Linux virtualization list
 help / color / mirror / Atom feed
* CFP ICETE 2019 - 16th Int.l Joint Conf. on e-Business and Telecommunications (Prague/Czech Republic)
From: icete @ 2018-12-27 23:05 UTC (permalink / raw)
  To: virtualization

SUBMISSION DEADLINE 

16th International Joint Conference on e-Business and Telecommunications

Submission Deadline: February 28, 2019

http://www.icete.org/

July 26 - 28, 2019
Prague, Czech Republic.

 


In Cooperation with: Photonics21 and EOS. 
                    
Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, SCOPUS, Semantic Scholar and Google Scholar. 
                    
 
A short list of presented papers will be selected so that revised and extended versions of these papers will be published by Springer.
 
All papers presented at the congress venue will also be available at the SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/).
  
Should you have any question please don't hesitate contacting me.
 

Kind regards,
ICETE Secretariat

Address: Av. D. Manuel I, 27A, 2º esq.
2910-595 Setubal, Portugal
Tel: +351 265 520 185
Fax: +351 265 520 186
Web: http://www.icete.org/
e-mail: icete.secretariat@insticc.org

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v37 0/3] Virtio-balloon: support free page reporting
From: Christian Borntraeger @ 2018-12-27 12:17 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, virtualization, kvm, linux-mm,
	mst, mhocko, akpm, dgilbert
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, Cornelia Huck,
	Halil Pasic, pbonzini, nilal, torvalds
In-Reply-To: <de167161-b586-6aee-d6a5-a90d47bbe1d4@de.ibm.com>



On 27.12.2018 12:59, Christian Borntraeger wrote:
> On 27.12.2018 12:31, Christian Borntraeger wrote:
>> This patch triggers random crashes in the guest kernel on s390 early during boot.
>> No migration and no setting of the balloon is involved.
>>
> 
> Adding Conny and Halil,
> 
> As the QEMU provides no PAGE_HINT feature yet, this quick hack makes the
> guest boot fine again:
> 
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 728ecd1eea305..aa2e1864c5736 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -492,7 +492,7 @@ static int init_vqs(struct virtio_balloon *vb)
>                 callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>         }
>  
> -       err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> +       err = vb->vdev->config->find_vqs(vb->vdev, 3, //VIRTIO_BALLOON_VQ_MAX,
>                                          vqs, callbacks, names, NULL, NULL);
>         if (err)
>                 return err;
> 
> 
> To me it looks like that virtio_ccw_find_vqs will abort if any of the virtqueues 
> that it is been asked for does not exist (including the earlier ones).
> 

This "hack" makes the random crashes go away, but the balloon interface itself
does not work. (setting the value to anything will hang the guest). 
As patch 1 also modifies the main path, there seem to be additional issues, maybe
endianess

Looking at things like

+		vb->cmd_id_received = VIRTIO_BALLOON_CMD_ID_STOP;
+		vb->cmd_id_active = cpu_to_virtio32(vb->vdev,
+						  VIRTIO_BALLOON_CMD_ID_STOP);
+		vb->cmd_id_stop = cpu_to_virtio32(vb->vdev,
+						  VIRTIO_BALLOON_CMD_ID_STOP);


Why is cmd_id_received not using cpu_to_virtio32?

^ permalink raw reply

* Re: [PATCH v37 1/3] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Christian Borntraeger @ 2018-12-27 12:03 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, virtualization, kvm, linux-mm,
	mst, mhocko, akpm, dgilbert
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, Cornelia Huck,
	Halil Pasic, pbonzini, nilal, torvalds
In-Reply-To: <1535333539-32420-2-git-send-email-wei.w.wang@intel.com>

On 27.08.2018 03:32, Wei Wang wrote:
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX];
> +	vq_callback_t *callbacks[VIRTIO_BALLOON_VQ_MAX];
> +	const char *names[VIRTIO_BALLOON_VQ_MAX];
> +	int err;
> 
>  	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> +	 * Inflateq and deflateq are used unconditionally. The names[]
> +	 * will be NULL if the related feature is not enabled, which will
> +	 * cause no allocation for the corresponding virtqueue in find_vqs.
>  	 */

This might be true for virtio-pci, but it is not for virtio-ccw.

> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	callbacks[VIRTIO_BALLOON_VQ_INFLATE] = balloon_ack;
> +	names[VIRTIO_BALLOON_VQ_INFLATE] = "inflate";
> +	callbacks[VIRTIO_BALLOON_VQ_DEFLATE] = balloon_ack;
> +	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> +	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> +	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> +		callbacks[VIRTIO_BALLOON_VQ_STATS] = stats_request;
> +	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> +		names[VIRTIO_BALLOON_VQ_FREE_PAGE] = "free_page_vq";
> +		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
[...]

^ permalink raw reply

* Re: [PATCH v37 0/3] Virtio-balloon: support free page reporting
From: Christian Borntraeger @ 2018-12-27 11:59 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, virtualization, kvm, linux-mm,
	mst, mhocko, akpm, dgilbert
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, Cornelia Huck,
	Halil Pasic, pbonzini, nilal, torvalds
In-Reply-To: <0661b05a-d9d0-d374-44e8-2583463e94c2@de.ibm.com>

On 27.12.2018 12:31, Christian Borntraeger wrote:
> This patch triggers random crashes in the guest kernel on s390 early during boot.
> No migration and no setting of the balloon is involved.
> 

Adding Conny and Halil,

As the QEMU provides no PAGE_HINT feature yet, this quick hack makes the
guest boot fine again:


diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 728ecd1eea305..aa2e1864c5736 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -492,7 +492,7 @@ static int init_vqs(struct virtio_balloon *vb)
                callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
        }
 
-       err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
+       err = vb->vdev->config->find_vqs(vb->vdev, 3, //VIRTIO_BALLOON_VQ_MAX,
                                         vqs, callbacks, names, NULL, NULL);
        if (err)
                return err;


To me it looks like that virtio_ccw_find_vqs will abort if any of the virtqueues 
that it is been asked for does not exist (including the earlier ones).


Christian

> 
> On 27.08.2018 03:32, Wei Wang wrote:
>> The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT, implemented by this
>> series enables the virtio-balloon driver to report hints of guest free
>> pages to host. It can be used to accelerate virtual machine (VM) live
>> migration. Here is an introduction of this usage:
>>
>> Live migration needs to transfer the VM's memory from the source machine
>> to the destination round by round. For the 1st round, all the VM's memory
>> is transferred. From the 2nd round, only the pieces of memory that were
>> written by the guest (after the 1st round) are transferred. One method
>> that is popularly used by the hypervisor to track which part of memory is
>> written is to have the hypervisor write-protect all the guest memory.
>>
>> This feature enables the optimization by skipping the transfer of guest
>> free pages during VM live migration. It is not concerned that the memory
>> pages are used after they are given to the hypervisor as a hint of the
>> free pages, because they will be tracked by the hypervisor and transferred
>> in the subsequent round if they are used and written.
>>
>> * Tests
>> 1 Test Environment
>>     Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
>>     Migration setup: migrate_set_speed 100G, migrate_set_downtime 400ms
>>
>> 2 Test Results (results are averaged over several repeated runs)
>>     2.1 Guest setup: 8G RAM, 4 vCPU
>>         2.1.1 Idle guest live migration time
>>             Optimization v.s. Legacy = 620ms vs 2970ms
>>             --> ~79% reduction
>>         2.1.2 Guest live migration with Linux compilation workload
>>           (i.e. make bzImage -j4) running
>>           1) Live Migration Time:
>>              Optimization v.s. Legacy = 2273ms v.s. 4502ms
>>              --> ~50% reduction
>>           2) Linux Compilation Time:
>>              Optimization v.s. Legacy = 8min42s v.s. 8min43s
>>              --> no obvious difference
>>
>>     2.2 Guest setup: 128G RAM, 4 vCPU
>>         2.2.1 Idle guest live migration time
>>             Optimization v.s. Legacy = 5294ms vs 41651ms
>>             --> ~87% reduction
>>         2.2.2 Guest live migration with Linux compilation workload
>>           1) Live Migration Time:
>>             Optimization v.s. Legacy = 8816ms v.s. 54201ms
>>             --> 84% reduction
>>           2) Linux Compilation Time:
>>              Optimization v.s. Legacy = 8min30s v.s. 8min36s
>>              --> no obvious difference
>>
>> ChangeLog:
>> v36->v37:
>>     - free the reported pages to mm when receives a DONE cmd from host.
>>       Please see patch 1's commit log for reasons. Please see patch 1's
>>       commit for detailed explanations.
>>
>> For ChangeLogs from v22 to v36, please reference
>> https://lkml.org/lkml/2018/7/20/199
>>
>> For ChangeLogs before v21, please reference
>> https://lwn.net/Articles/743660/
>>
>> Wei Wang (3):
>>   virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
>>   mm/page_poison: expose page_poisoning_enabled to kernel modules
>>   virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
>>
>>  drivers/virtio/virtio_balloon.c     | 374 ++++++++++++++++++++++++++++++++----
>>  include/uapi/linux/virtio_balloon.h |   8 +
>>  mm/page_poison.c                    |   6 +
>>  3 files changed, 355 insertions(+), 33 deletions(-)
>>

^ permalink raw reply related

* Re: [PATCH v37 0/3] Virtio-balloon: support free page reporting
From: Christian Borntraeger @ 2018-12-27 11:31 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, virtualization, kvm, linux-mm,
	mst, mhocko, akpm, dgilbert
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
	nilal, torvalds
In-Reply-To: <1535333539-32420-1-git-send-email-wei.w.wang@intel.com>

This patch triggers random crashes in the guest kernel on s390 early during boot.
No migration and no setting of the balloon is involved.




On 27.08.2018 03:32, Wei Wang wrote:
> The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT, implemented by this
> series enables the virtio-balloon driver to report hints of guest free
> pages to host. It can be used to accelerate virtual machine (VM) live
> migration. Here is an introduction of this usage:
> 
> Live migration needs to transfer the VM's memory from the source machine
> to the destination round by round. For the 1st round, all the VM's memory
> is transferred. From the 2nd round, only the pieces of memory that were
> written by the guest (after the 1st round) are transferred. One method
> that is popularly used by the hypervisor to track which part of memory is
> written is to have the hypervisor write-protect all the guest memory.
> 
> This feature enables the optimization by skipping the transfer of guest
> free pages during VM live migration. It is not concerned that the memory
> pages are used after they are given to the hypervisor as a hint of the
> free pages, because they will be tracked by the hypervisor and transferred
> in the subsequent round if they are used and written.
> 
> * Tests
> 1 Test Environment
>     Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
>     Migration setup: migrate_set_speed 100G, migrate_set_downtime 400ms
> 
> 2 Test Results (results are averaged over several repeated runs)
>     2.1 Guest setup: 8G RAM, 4 vCPU
>         2.1.1 Idle guest live migration time
>             Optimization v.s. Legacy = 620ms vs 2970ms
>             --> ~79% reduction
>         2.1.2 Guest live migration with Linux compilation workload
>           (i.e. make bzImage -j4) running
>           1) Live Migration Time:
>              Optimization v.s. Legacy = 2273ms v.s. 4502ms
>              --> ~50% reduction
>           2) Linux Compilation Time:
>              Optimization v.s. Legacy = 8min42s v.s. 8min43s
>              --> no obvious difference
> 
>     2.2 Guest setup: 128G RAM, 4 vCPU
>         2.2.1 Idle guest live migration time
>             Optimization v.s. Legacy = 5294ms vs 41651ms
>             --> ~87% reduction
>         2.2.2 Guest live migration with Linux compilation workload
>           1) Live Migration Time:
>             Optimization v.s. Legacy = 8816ms v.s. 54201ms
>             --> 84% reduction
>           2) Linux Compilation Time:
>              Optimization v.s. Legacy = 8min30s v.s. 8min36s
>              --> no obvious difference
> 
> ChangeLog:
> v36->v37:
>     - free the reported pages to mm when receives a DONE cmd from host.
>       Please see patch 1's commit log for reasons. Please see patch 1's
>       commit for detailed explanations.
> 
> For ChangeLogs from v22 to v36, please reference
> https://lkml.org/lkml/2018/7/20/199
> 
> For ChangeLogs before v21, please reference
> https://lwn.net/Articles/743660/
> 
> Wei Wang (3):
>   virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
>   mm/page_poison: expose page_poisoning_enabled to kernel modules
>   virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
> 
>  drivers/virtio/virtio_balloon.c     | 374 ++++++++++++++++++++++++++++++++----
>  include/uapi/linux/virtio_balloon.h |   8 +
>  mm/page_poison.c                    |   6 +
>  3 files changed, 355 insertions(+), 33 deletions(-)
> 

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio-net: bql support
From: Jason Wang @ 2018-12-27 10:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, virtualization, maxime.coquelin, wexu,
	David S. Miller
In-Reply-To: <20181226102100-mutt-send-email-mst@kernel.org>


On 2018/12/26 下午11:22, Michael S. Tsirkin wrote:
> On Thu, Dec 06, 2018 at 04:17:36PM +0800, Jason Wang wrote:
>> On 2018/12/6 上午6:54, Michael S. Tsirkin wrote:
>>> When use_napi is set, let's enable BQLs.  Note: some of the issues are
>>> similar to wifi.  It's worth considering whether something similar to
>>> commit 36148c2bbfbe ("mac80211: Adjust TSQ pacing shift") might be
>>> benefitial.
>>
>> I've played a similar patch several days before. The tricky part is the mode
>> switching between napi and no napi. We should make sure when the packet is
>> sent and trakced by BQL,  it should be consumed by BQL as well.
>
> I just went over the patch again and I don't understand this comment.
> This patch only enabled BQL with tx napi.
>
> Thus there's no mode switching.
>
> What did I miss?


Consider the case:


TX NAPI is disabled:

send N packets

turn TX NAPI on:

get tx interrupt

BQL try to consume those packets when lead WARN for dql.


Thanks



>
>
>> I did it by
>> tracking it through skb->cb.  And deal with the freeze by reset the BQL
>> status. Patch attached.
>>
>> But when testing with vhost-net, I don't very a stable performance, it was
>> probably because we batch the used ring updating so tx interrupt may come
>> randomly. We probably need to implement time bounded coalescing mechanism
>> which could be configured from userspace.
>>
>> Btw, maybe it's time just enable napi TX by default. I get ~10% TCP_RR
>> regression on machine without APICv, (haven't found time to test APICv
>> machine). But consider it was for correctness, I think it's acceptable? Then
>> we can do optimization on top?
>>
>>
>> Thanks
>>
>>
>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>> ---
>>>    drivers/net/virtio_net.c | 27 +++++++++++++++++++--------
>>>    1 file changed, 19 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index cecfd77c9f3c..b657bde6b94b 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -1325,7 +1325,8 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
>>>    	return stats.packets;
>>>    }
>>> -static void free_old_xmit_skbs(struct send_queue *sq)
>>> +static void free_old_xmit_skbs(struct send_queue *sq, struct netdev_queue *txq,
>>> +			       bool use_napi)
>>>    {
>>>    	struct sk_buff *skb;
>>>    	unsigned int len;
>>> @@ -1347,6 +1348,9 @@ static void free_old_xmit_skbs(struct send_queue *sq)
>>>    	if (!packets)
>>>    		return;
>>> +	if (use_napi)
>>> +		netdev_tx_completed_queue(txq, packets, bytes);
>>> +
>>>    	u64_stats_update_begin(&sq->stats.syncp);
>>>    	sq->stats.bytes += bytes;
>>>    	sq->stats.packets += packets;
>>> @@ -1364,7 +1368,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
>>>    		return;
>>>    	if (__netif_tx_trylock(txq)) {
>>> -		free_old_xmit_skbs(sq);
>>> +		free_old_xmit_skbs(sq, txq, true);
>>>    		__netif_tx_unlock(txq);
>>>    	}
>>> @@ -1440,7 +1444,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>>>    	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
>>>    	__netif_tx_lock(txq, raw_smp_processor_id());
>>> -	free_old_xmit_skbs(sq);
>>> +	free_old_xmit_skbs(sq, txq, true);
>>>    	__netif_tx_unlock(txq);
>>>    	virtqueue_napi_complete(napi, sq->vq, 0);
>>> @@ -1505,13 +1509,15 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>>    	struct send_queue *sq = &vi->sq[qnum];
>>>    	int err;
>>>    	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>>> -	bool kick = !skb->xmit_more;
>>> +	bool more = skb->xmit_more;
>>>    	bool use_napi = sq->napi.weight;
>>> +	unsigned int bytes = skb->len;
>>> +	bool kick;
>>>    	/* Free up any pending old buffers before queueing new ones. */
>>> -	free_old_xmit_skbs(sq);
>>> +	free_old_xmit_skbs(sq, txq, use_napi);
>>> -	if (use_napi && kick)
>>> +	if (use_napi && !more)
>>>    		virtqueue_enable_cb_delayed(sq->vq);
>>>    	/* timestamp packet in software */
>>> @@ -1552,7 +1558,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>>    		if (!use_napi &&
>>>    		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>>>    			/* More just got used, free them then recheck. */
>>> -			free_old_xmit_skbs(sq);
>>> +			free_old_xmit_skbs(sq, txq, false);
>>>    			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
>>>    				netif_start_subqueue(dev, qnum);
>>>    				virtqueue_disable_cb(sq->vq);
>>> @@ -1560,7 +1566,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>>    		}
>>>    	}
>>> -	if (kick || netif_xmit_stopped(txq)) {
>>> +	if (use_napi)
>>> +		kick = __netdev_tx_sent_queue(txq, bytes, more);
>>> +	else
>>> +		kick = !more || netif_xmit_stopped(txq);
>>> +
>>> +	if (kick) {
>>>    		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq)) {
>>>    			u64_stats_update_begin(&sq->stats.syncp);
>>>    			sq->stats.kicks++;
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio-net: bql support
From: Jason Wang @ 2018-12-27 10:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, virtualization, maxime.coquelin, wexu,
	David S. Miller
In-Reply-To: <20181226101528-mutt-send-email-mst@kernel.org>


On 2018/12/26 下午11:19, Michael S. Tsirkin wrote:
> On Thu, Dec 06, 2018 at 04:17:36PM +0800, Jason Wang wrote:
>> On 2018/12/6 上午6:54, Michael S. Tsirkin wrote:
>>> When use_napi is set, let's enable BQLs.  Note: some of the issues are
>>> similar to wifi.  It's worth considering whether something similar to
>>> commit 36148c2bbfbe ("mac80211: Adjust TSQ pacing shift") might be
>>> benefitial.
>>
>> I've played a similar patch several days before. The tricky part is the mode
>> switching between napi and no napi. We should make sure when the packet is
>> sent and trakced by BQL,  it should be consumed by BQL as well. I did it by
>> tracking it through skb->cb.  And deal with the freeze by reset the BQL
>> status. Patch attached.
>>
>> But when testing with vhost-net, I don't very a stable performance,
> So how about increasing TSQ pacing shift then?


I can test this. But changing default TCP value is much more than a 
virtio-net specific thing.


>
>> it was
>> probably because we batch the used ring updating so tx interrupt may come
>> randomly. We probably need to implement time bounded coalescing mechanism
>> which could be configured from userspace.
> I don't think it's reasonable to expect userspace to be that smart ...
> Why do we need time bounded? used ring is always updated when ring
> becomes empty.


We don't add used when means BQL may not see the consumed packet in 
time. And the delay varies based on the workload since we count packets 
not bytes or time before doing the batched updating.

Thanks


>
>> Btw, maybe it's time just enable napi TX by default. I get ~10% TCP_RR
>> regression on machine without APICv, (haven't found time to test APICv
>> machine). But consider it was for correctness, I think it's acceptable? Then
>> we can do optimization on top?
>>
>>
>> Thanks
>>
>>
>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>> ---
>>>    drivers/net/virtio_net.c | 27 +++++++++++++++++++--------
>>>    1 file changed, 19 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index cecfd77c9f3c..b657bde6b94b 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -1325,7 +1325,8 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
>>>    	return stats.packets;
>>>    }
>>> -static void free_old_xmit_skbs(struct send_queue *sq)
>>> +static void free_old_xmit_skbs(struct send_queue *sq, struct netdev_queue *txq,
>>> +			       bool use_napi)
>>>    {
>>>    	struct sk_buff *skb;
>>>    	unsigned int len;
>>> @@ -1347,6 +1348,9 @@ static void free_old_xmit_skbs(struct send_queue *sq)
>>>    	if (!packets)
>>>    		return;
>>> +	if (use_napi)
>>> +		netdev_tx_completed_queue(txq, packets, bytes);
>>> +
>>>    	u64_stats_update_begin(&sq->stats.syncp);
>>>    	sq->stats.bytes += bytes;
>>>    	sq->stats.packets += packets;
>>> @@ -1364,7 +1368,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
>>>    		return;
>>>    	if (__netif_tx_trylock(txq)) {
>>> -		free_old_xmit_skbs(sq);
>>> +		free_old_xmit_skbs(sq, txq, true);
>>>    		__netif_tx_unlock(txq);
>>>    	}
>>> @@ -1440,7 +1444,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>>>    	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
>>>    	__netif_tx_lock(txq, raw_smp_processor_id());
>>> -	free_old_xmit_skbs(sq);
>>> +	free_old_xmit_skbs(sq, txq, true);
>>>    	__netif_tx_unlock(txq);
>>>    	virtqueue_napi_complete(napi, sq->vq, 0);
>>> @@ -1505,13 +1509,15 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>>    	struct send_queue *sq = &vi->sq[qnum];
>>>    	int err;
>>>    	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>>> -	bool kick = !skb->xmit_more;
>>> +	bool more = skb->xmit_more;
>>>    	bool use_napi = sq->napi.weight;
>>> +	unsigned int bytes = skb->len;
>>> +	bool kick;
>>>    	/* Free up any pending old buffers before queueing new ones. */
>>> -	free_old_xmit_skbs(sq);
>>> +	free_old_xmit_skbs(sq, txq, use_napi);
>>> -	if (use_napi && kick)
>>> +	if (use_napi && !more)
>>>    		virtqueue_enable_cb_delayed(sq->vq);
>>>    	/* timestamp packet in software */
>>> @@ -1552,7 +1558,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>>    		if (!use_napi &&
>>>    		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>>>    			/* More just got used, free them then recheck. */
>>> -			free_old_xmit_skbs(sq);
>>> +			free_old_xmit_skbs(sq, txq, false);
>>>    			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
>>>    				netif_start_subqueue(dev, qnum);
>>>    				virtqueue_disable_cb(sq->vq);
>>> @@ -1560,7 +1566,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>>    		}
>>>    	}
>>> -	if (kick || netif_xmit_stopped(txq)) {
>>> +	if (use_napi)
>>> +		kick = __netdev_tx_sent_queue(txq, bytes, more);
>>> +	else
>>> +		kick = !more || netif_xmit_stopped(txq);
>>> +
>>> +	if (kick) {
>>>    		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq)) {
>>>    			u64_stats_update_begin(&sq->stats.syncp);
>>>    			sq->stats.kicks++;
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio-net: bql support
From: Jason Wang @ 2018-12-27  9:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, virtualization, maxime.coquelin, wexu,
	David S. Miller
In-Reply-To: <20181226101411-mutt-send-email-mst@kernel.org>


On 2018/12/26 下午11:15, Michael S. Tsirkin wrote:
> On Thu, Dec 06, 2018 at 04:17:36PM +0800, Jason Wang wrote:
>> On 2018/12/6 上午6:54, Michael S. Tsirkin wrote:
>>> When use_napi is set, let's enable BQLs.  Note: some of the issues are
>>> similar to wifi.  It's worth considering whether something similar to
>>> commit 36148c2bbfbe ("mac80211: Adjust TSQ pacing shift") might be
>>> benefitial.
>>
>> I've played a similar patch several days before. The tricky part is the mode
>> switching between napi and no napi. We should make sure when the packet is
>> sent and trakced by BQL,  it should be consumed by BQL as well. I did it by
>> tracking it through skb->cb.  And deal with the freeze by reset the BQL
>> status. Patch attached.
>>
>> But when testing with vhost-net, I don't very a stable performance, it was
>> probably because we batch the used ring updating so tx interrupt may come
>> randomly. We probably need to implement time bounded coalescing mechanism
>> which could be configured from userspace.
>>
>> Btw, maybe it's time just enable napi TX by default. I get ~10% TCP_RR
>> regression on machine without APICv, (haven't found time to test APICv
>> machine). But consider it was for correctness, I think it's acceptable? Then
>> we can do optimization on top?
>>
>>
>> Thanks
> I don't see how it's for correctness to be frank.


Socket accounting is wrong in the case. This should be a bug in fact.


> What if we just do the bulk free? Does that fix the regression?


I can test it.


>
>
>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>> ---
>>>    drivers/net/virtio_net.c | 27 +++++++++++++++++++--------
>>>    1 file changed, 19 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index cecfd77c9f3c..b657bde6b94b 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -1325,7 +1325,8 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
>>>    	return stats.packets;
>>>    }
>>> -static void free_old_xmit_skbs(struct send_queue *sq)
>>> +static void free_old_xmit_skbs(struct send_queue *sq, struct netdev_queue *txq,
>>> +			       bool use_napi)
>>>    {
>>>    	struct sk_buff *skb;
>>>    	unsigned int len;
>>> @@ -1347,6 +1348,9 @@ static void free_old_xmit_skbs(struct send_queue *sq)
>>>    	if (!packets)
>>>    		return;
>>> +	if (use_napi)
>>> +		netdev_tx_completed_queue(txq, packets, bytes);
>>> +
>>>    	u64_stats_update_begin(&sq->stats.syncp);
>>>    	sq->stats.bytes += bytes;
>>>    	sq->stats.packets += packets;
>>> @@ -1364,7 +1368,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
>>>    		return;
>>>    	if (__netif_tx_trylock(txq)) {
>>> -		free_old_xmit_skbs(sq);
>>> +		free_old_xmit_skbs(sq, txq, true);
>>>    		__netif_tx_unlock(txq);
>>>    	}
>>> @@ -1440,7 +1444,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>>>    	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
>>>    	__netif_tx_lock(txq, raw_smp_processor_id());
>>> -	free_old_xmit_skbs(sq);
>>> +	free_old_xmit_skbs(sq, txq, true);
>>>    	__netif_tx_unlock(txq);
>>>    	virtqueue_napi_complete(napi, sq->vq, 0);
>>> @@ -1505,13 +1509,15 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>>    	struct send_queue *sq = &vi->sq[qnum];
>>>    	int err;
>>>    	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>>> -	bool kick = !skb->xmit_more;
>>> +	bool more = skb->xmit_more;
>>>    	bool use_napi = sq->napi.weight;
>>> +	unsigned int bytes = skb->len;
>>> +	bool kick;
>>>    	/* Free up any pending old buffers before queueing new ones. */
>>> -	free_old_xmit_skbs(sq);
>>> +	free_old_xmit_skbs(sq, txq, use_napi);
>>> -	if (use_napi && kick)
>>> +	if (use_napi && !more)
>>>    		virtqueue_enable_cb_delayed(sq->vq);
>>>    	/* timestamp packet in software */
>>> @@ -1552,7 +1558,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>>    		if (!use_napi &&
>>>    		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>>>    			/* More just got used, free them then recheck. */
>>> -			free_old_xmit_skbs(sq);
>>> +			free_old_xmit_skbs(sq, txq, false);
>>>    			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
>>>    				netif_start_subqueue(dev, qnum);
>>>    				virtqueue_disable_cb(sq->vq);
>>> @@ -1560,7 +1566,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>>    		}
>>>    	}
>>> -	if (kick || netif_xmit_stopped(txq)) {
>>> +	if (use_napi)
>>> +		kick = __netdev_tx_sent_queue(txq, bytes, more);
>>> +	else
>>> +		kick = !more || netif_xmit_stopped(txq);
>>> +
>>> +	if (kick) {
>>>    		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq)) {
>>>    			u64_stats_update_begin(&sq->stats.syncp);
>>>    			sq->stats.kicks++;
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
From: Jason Wang @ 2018-12-27  9:39 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181226092431-mutt-send-email-mst@kernel.org>


On 2018/12/26 下午11:02, Michael S. Tsirkin wrote:
> On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote:
>> On 2018/12/25 下午8:50, Michael S. Tsirkin wrote:
>>> On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
>>>> On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
>>>>> On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
>>>>>> On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
>>>>>>> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
>>>>>>>> On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>>>>>>>>>> It was noticed that the copy_user() friends that was used to access
>>>>>>>>>> virtqueue metdata tends to be very expensive for dataplane
>>>>>>>>>> implementation like vhost since it involves lots of software check,
>>>>>>>>>> speculation barrier, hardware feature toggling (e.g SMAP). The
>>>>>>>>>> extra cost will be more obvious when transferring small packets.
>>>>>>>>>>
>>>>>>>>>> This patch tries to eliminate those overhead by pin vq metadata pages
>>>>>>>>>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>>>>>>>>>> those mappings and memory accessors are modified to use pointers to
>>>>>>>>>> access the metadata directly.
>>>>>>>>>>
>>>>>>>>>> Note, this was only done when device IOTLB is not enabled. We could
>>>>>>>>>> use similar method to optimize it in the future.
>>>>>>>>>>
>>>>>>>>>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>>>>>>>>>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>>>>>>>>>
>>>>>>>>>> Before: ~5.0Mpps
>>>>>>>>>> After:  ~6.1Mpps
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>>>>>>>> ---
>>>>>>>>>>       drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>       drivers/vhost/vhost.h |  11 +++
>>>>>>>>>>       2 files changed, 189 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>>>>>> index bafe39d2e637..1bd24203afb6 100644
>>>>>>>>>> --- a/drivers/vhost/vhost.c
>>>>>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>>>>>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>>>>>>>>       		vq->indirect = NULL;
>>>>>>>>>>       		vq->heads = NULL;
>>>>>>>>>>       		vq->dev = dev;
>>>>>>>>>> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>>>>>>>>>> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>>>>>>>>>> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>>>>>>>>>       		mutex_init(&vq->mutex);
>>>>>>>>>>       		vhost_vq_reset(dev, vq);
>>>>>>>>>>       		if (vq->handle_kick)
>>>>>>>>>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>>>>>>>>>       	spin_unlock(&dev->iotlb_lock);
>>>>>>>>>>       }
>>>>>>>>>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>>>>>>>>>> +			   size_t size, int write)
>>>>>>>>>> +{
>>>>>>>>>> +	struct page **pages;
>>>>>>>>>> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>>>>>>>>>> +	int npinned;
>>>>>>>>>> +	void *vaddr;
>>>>>>>>>> +
>>>>>>>>>> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>>>>>>>>>> +	if (!pages)
>>>>>>>>>> +		return -ENOMEM;
>>>>>>>>>> +
>>>>>>>>>> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
>>>>>>>>>> +	if (npinned != npages)
>>>>>>>>>> +		goto err;
>>>>>>>>>> +
>>>>>>>>> As I said I have doubts about the whole approach, but this
>>>>>>>>> implementation in particular isn't a good idea
>>>>>>>>> as it keeps the page around forever.
>>>>>> The pages wil be released during set features.
>>>>>>
>>>>>>
>>>>>>>>> So no THP, no NUMA rebalancing,
>>>>>> For THP, we will probably miss 2 or 4 pages, but does this really matter
>>>>>> consider the gain we have?
>>>>> We as in vhost? networking isn't the only thing guest does.
>>>>> We don't even know if this guest does a lot of networking.
>>>>> You don't
>>>>> know what else is in this huge page. Can be something very important
>>>>> that guest touches all the time.
>>>> Well, the probability should be very small consider we usually give several
>>>> gigabytes to guest. The rest of the pages that doesn't sit in the same
>>>> hugepage with metadata can still be merged by THP.  Anyway, I can test the
>>>> differences.
>>> Thanks!
>>>
>>>>>> For NUMA rebalancing, I'm even not quite sure if
>>>>>> it can helps for the case of IPC (vhost). It looks to me the worst case it
>>>>>> may cause page to be thrash between nodes if vhost and userspace are running
>>>>>> in two nodes.
>>>>> So again it's a gain for vhost but has a completely unpredictable effect on
>>>>> other functionality of the guest.
>>>>>
>>>>> That's what bothers me with this approach.
>>>> So:
>>>>
>>>> - The rest of the pages could still be balanced to other nodes, no?
>>>>
>>>> - try to balance metadata pages (belongs to co-operate processes) itself is
>>>> still questionable
>>> I am not sure why. It should be easy enough to force the VCPU and vhost
>>> to move (e.g. start them pinned to 1 cpu, then pin them to another one).
>>> Clearly sometimes this would be necessary for load balancing reasons.
>>
>> Yes, but it looks to me the part of motivation of auto NUMA is to avoid
>> manual pinning.
> ... of memory. Yes.
>
>
>>> With autonuma after a while (could take seconds but it will happen) the
>>> memory will migrate.
>>>
>> Yes. As you mentioned during the discuss, I wonder we could do it similarly
>> through mmu notifier like APIC access page in commit c24ae0dcd3e ("kvm: x86:
>> Unpin and remove kvm_arch->apic_access_page")
> That would be a possible approach.


Yes, this looks possible, and the conversion seems not hard. Let me have 
a try with this.


[...]


>>>>>>> I don't see how a kthread makes any difference. We do have a validation
>>>>>>> step which makes some difference.
>>>>>> The problem is not kthread but the address of userspace address. The
>>>>>> addresses of vq metadata tends to be consistent for a while, and vhost knows
>>>>>> they will be frequently. SMAP doesn't help too much in this case.
>>>>>>
>>>>>> Thanks.
>>>>> It's true for a real life applications but a malicious one
>>>>> can call the setup ioctls any number of times. And SMAP is
>>>>> all about malcious applications.
>>>> We don't do this in the path of ioctl, there's no context switch between
>>>> userspace and kernel in the worker thread. SMAP is used to prevent kernel
>>>> from accessing userspace pages unexpectedly which is not the case for
>>>> metadata access.
>>>>
>>>> Thanks
>>> OK let's forget smap for now.
>>
>> Some numbers I measured:
>>
>> On an old Sandy bridge machine without SMAP support. Remove speculation
>> barrier boost the performance from 4.6Mpps to 5.1Mpps
>>
>> On a newer Broadwell machine with SMAP support. Remove speculation barrier
>> only gives 2%-5% improvement, disable SMAP completely through Kconfig boost
>> 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps, it
>> only bypass SMAP for metadata).
>>
>> So it looks like for recent machine, SMAP becomes pain point when the copy
>> is short (e.g 64B) for high PPS.
>>
>> Thanks
> Thanks a lot for looking into this!
>
> So first of all users can just boot with nosmap, right?
> What's wrong with that?


Nothing wrong, just realize we had this kernel parameter.


>   Yes it's not fine-grained but OTOH
> it's easy to understand.
>
> And I guess this confirms that if we are going to worry
> about smap enabled, we need to look into packet copies
> too, not just meta-data.


For packet copies, we can do batch copy which is pretty simple for the 
case of XDP. I've already had patches for this.


>
> Vaguely could see a module option (off by default)
> where vhost basically does user_access_begin
> when it starts running, then uses unsafe accesses
> in vhost and tun and then user_access_end.


Using user_access_begin() is more tricky than imaged. E.g it requires:

- userspace address to be validated before through access_ok() [1]

- It doesn't support calling a function that does explicit schedule 
since SMAP/PAN state is not maintained through schedule() [2]

[1] https://lwn.net/Articles/736348/

[2] https://lkml.org/lkml/2018/11/23/430

So calling user_access_begin() all the time when vhost is running seems 
pretty dangerous.

For a better batched datacopy, I tend to build not only XDP but also skb 
in vhost in the future.

Thanks


>
>
>>>>>>>> Packet or AF_XDP benefit from
>>>>>>>> accessing metadata directly, we should do it as well.
>>>>>>>>
>>>>>>>> Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net V2 4/4] vhost: log dirty page correctly
From: Jason Wang @ 2018-12-27  9:32 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jintack Lim, netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181226083630-mutt-send-email-mst@kernel.org>


On 2018/12/26 下午9:46, Michael S. Tsirkin wrote:
> On Wed, Dec 26, 2018 at 01:43:26PM +0800, Jason Wang wrote:
>> On 2018/12/26 上午12:25, Michael S. Tsirkin wrote:
>>> On Tue, Dec 25, 2018 at 05:43:25PM +0800, Jason Wang wrote:
>>>> On 2018/12/25 上午1:41, Michael S. Tsirkin wrote:
>>>>> On Mon, Dec 24, 2018 at 11:43:31AM +0800, Jason Wang wrote:
>>>>>> On 2018/12/14 下午9:20, Michael S. Tsirkin wrote:
>>>>>>> On Fri, Dec 14, 2018 at 10:43:03AM +0800, Jason Wang wrote:
>>>>>>>> On 2018/12/13 下午10:31, Michael S. Tsirkin wrote:
>>>>>>>>>> Just to make sure I understand this. It looks to me we should:
>>>>>>>>>>
>>>>>>>>>> - allow passing GIOVA->GPA through UAPI
>>>>>>>>>>
>>>>>>>>>> - cache GIOVA->GPA somewhere but still use GIOVA->HVA in device IOTLB for
>>>>>>>>>> performance
>>>>>>>>>>
>>>>>>>>>> Is this what you suggest?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>> Not really. We already have GPA->HVA, so I suggested a flag to pass
>>>>>>>>> GIOVA->GPA in the IOTLB.
>>>>>>>>>
>>>>>>>>> This has advantages for security since a single table needs
>>>>>>>>> then to be validated to ensure guest does not corrupt
>>>>>>>>> QEMU memory.
>>>>>>>>>
>>>>>>>> I wonder how much we can gain through this. Currently, qemu IOMMU gives
>>>>>>>> GIOVA->GPA mapping, and qemu vhost code will translate GPA to HVA then pass
>>>>>>>> GIOVA->HVA to vhost. It looks no difference to me.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>> The difference is in security not in performance.  Getting a bad HVA
>>>>>>> corrupts QEMU memory and it might be guest controlled. Very risky.
>>>>>> How can this be controlled by guest? HVA was generated from qemu ram blocks
>>>>>> which is totally under the control of qemu memory core instead of guest.
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>> It is ultimately under guest influence as guest supplies IOVA->GPA
>>>>> translations.  qemu translates GPA->HVA and gives the translated result
>>>>> to the kernel.  If it's not buggy and kernel isn't buggy it's all
>>>>> fine.
>>>> If qemu provides buggy GPA->HVA, we can't workaround this. And I don't get
>>>> the point why we even want to try this. Buggy qemu code can crash itself in
>>>> many ways.
>>>>
>>>>
>>>>> But that's the approach that was proven not to work in the 20th century.
>>>>> In the 21st century we are trying defence in depth approach.
>>>>>
>>>>> My point is that a single code path that is responsible for
>>>>> the HVA translations is better than two.
>>>>>
>>>> So the difference whether or not use memory table information:
>>>>
>>>> Current:
>>>>
>>>> 1) SET_MEM_TABLE: GPA->HVA
>>>>
>>>> 2) Qemu GIOVA->GPA
>>>>
>>>> 3) Qemu GPA->HVA
>>>>
>>>> 4) IOTLB_UPDATE: GIOVA->HVA
>>>>
>>>> If I understand correctly you want to drop step 3 consider it might be buggy
>>>> which is just 19 lines of code in qemu (vhost_memory_region_lookup()). This
>>>> will ends up:
>>>>
>>>> 1) Do GPA->HVA translation in IOTLB_UPDATE path (I believe we won't want to
>>>> do it during device IOTLB lookup).
>>>>
>>>> 2) Extra bits to enable this capability.
>>>>
>>>> So this looks need more codes in kernel than what qemu did in userspace.  Is
>>>> this really worthwhile?
>>>>
>>>> Thanks
>>> So there are several points I would like to make
>>>
>>> 1. At the moment without an iommu it is possible to
>>>      change GPA-HVA mappings and everything keeps working
>>>      because a change in memory tables flushes the rings.
>>
>> Interesting, I don't know this before. But when can this happen?
>
> It doesn't happen with existing qemu. But it seems like a valid
> thing to do to remap memory at a different address.
>

Ok.


>>>      However I don't see the iotlb cache being invalidated
>>>      on that path - did I miss it? If it is not there it's
>>>      a related minor bug.
>>
>> It might have a bug. But a question is consider the case without IOMMU. We
>> only update mem table (SET_MEM_TABLE), but not vring address. This looks
>> like a bug as well?
> I think that without an iommu it can only work without races if backend is
> stopped or if the vring isn't in guest memory with ring aliasing).


Right.


>
>>> 2. qemu already has a GPA. Discarding it and re-calculating
>>>      when logging is on just seems wrong.
>>>      However if you would like to *also* keep the HVA in the iotlb
>>>      to avoid doing extra translations, that sounds like a
>>>      reasonable optimization.
>>
>> Yes, traverse GPA->HVA mapping seems unnecessary.
>>
>>
>>> 3. it also means that the hva->gpa translation only runs
>>>      when logging is enabled. That is a rarely excercised
>>>      path so any bugs there will not be caught.
>>
>> I wonder maybe some kind of unit-test may help here.
>>
>>
>>> So I really would like us long term to move away from
>>> hva->gpa translations, keep them for legacy userspace only
>>> but I don't really mind how we do it.
>>>
>>> How about
>>> - a new flag to pass an iotlb with *both* a gpa and hva
>>> - for legacy userspace, calculate the gpa on iotlb update
>>>     so the device then uses a shared code path
>>>
>>> what do you think?
>>>
>>>
>> I don't object this idea so I can try, just want to figure out why it was a
>> must.
>>
>> Thanks
> Not a must but I think it's a good interface extension.
>

Ok. let me try to do this.

Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: PROPOSAL: Extend inline asm syntax with size spec
From: Masahiro Yamada @ 2018-12-27  4:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kate Stewart, Christopher Li, virtualization, Max Filippov,
	Nadav Amit, Jan Beulich, H. Peter Anvin, Sam Ravnborg,
	Ingo Molnar, X86 ML, linux-sparse, Ingo Molnar, linux-xtensa,
	Kees Cook, Segher Boessenkool, Chris Zankel, Michael Matz,
	Borislav Petkov, Josh Poimboeuf, Alok Kataria, Juergen Gross, gcc,
	Richard Biener, Greg Kroah-Hartman, Linux Kernel Mailing List
In-Reply-To: <20181031125526.GA13219@hirez.programming.kicks-ass.net>

Hi Peter,


On Wed, Oct 31, 2018 at 9:58 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sat, Oct 13, 2018 at 09:33:35PM +0200, Borislav Petkov wrote:
> > Ok,
> >
> > with Segher's help I've been playing with his patch ontop of bleeding
> > edge gcc 9 and here are my observations. Please double-check me for
> > booboos so that they can be addressed while there's time.
> >
> > So here's what I see ontop of 4.19-rc7:
> >
> > First marked the alternative asm() as inline and undeffed the "inline"
> > keyword. I need to do that because of the funky games we do with
> > "inline" redefinitions in include/linux/compiler_types.h.
> >
> > And Segher hinted at either doing:
> >
> > asm volatile inline(...
> >
> > or
> >
> > asm volatile __inline__(...
> >
> > but both "inline" variants are defined as macros in that file.
> >
> > Which means we either need to #undef inline before using it in asm() or
> > come up with something cleverer.
>
> # git grep -e "\<__inline__\>" | wc -l
> 488
> # git grep -e "\<__inline\>" | wc -l
> 56
> # git grep -e "\<inline\>" | wc -l
> 69598
>
> And we already have scripts/checkpatch.pl:
>
>   # Check for __inline__ and __inline, prefer inline
>
> Which suggests we do:
>
> git grep -l "\<__inline\(\|__\)\>" | while read file
> do
>         sed -i -e 's/\<__inline\(\|__\)\>/inline/g' $file
> done
>
> and get it over with.


Do you have a plan to really do this?

This is a nice cleanup anyway.

I think the last minute of MW is
a good timing for the global replacement like this.




>
> Anyway, with the below patch, I get:
>
>    text    data     bss     dec     hex filename
> 17385183        5064780 1953892 24403855        1745f8f defconfig-build/vmlinux
> 17385678        5064780 1953892 24404350        174617e defconfig-build/vmlinux
>
> Which shows we inline more (look for asm_volatile for the actual
> changes).
>
>
> So yes, this seems like something we could work with.
>
> ---
>  Documentation/trace/tracepoint-analysis.rst        |  2 +-
>  Documentation/translations/ja_JP/SubmittingPatches |  4 +--
>  Documentation/translations/zh_CN/SubmittingPatches |  4 +--
>  arch/alpha/include/asm/atomic.h                    | 12 +++----
>  arch/alpha/include/asm/bitops.h                    |  6 ++--
>  arch/alpha/include/asm/compiler.h                  |  4 +--
>  arch/alpha/include/asm/dma.h                       | 22 ++++++------
>  arch/alpha/include/asm/floppy.h                    |  4 +--
>  arch/alpha/include/asm/irq.h                       |  2 +-
>  arch/alpha/include/asm/local.h                     |  4 +--
>  arch/alpha/include/asm/smp.h                       |  2 +-
>  arch/arm/mach-iop32x/include/mach/uncompress.h     |  2 +-
>  arch/arm/mach-iop33x/include/mach/uncompress.h     |  2 +-
>  arch/arm/mach-ixp4xx/include/mach/uncompress.h     |  2 +-
>  arch/ia64/hp/common/sba_iommu.c                    |  2 +-
>  arch/ia64/hp/sim/simeth.c                          |  2 +-
>  arch/ia64/include/asm/atomic.h                     |  8 ++---
>  arch/ia64/include/asm/bitops.h                     | 34 +++++++++---------
>  arch/ia64/include/asm/delay.h                      | 14 ++++----
>  arch/ia64/include/asm/irq.h                        |  2 +-
>  arch/ia64/include/asm/page.h                       |  2 +-
>  arch/ia64/include/asm/sn/leds.h                    |  2 +-
>  arch/ia64/include/asm/uaccess.h                    |  4 +--
>  arch/ia64/include/uapi/asm/rse.h                   | 12 +++----
>  arch/ia64/include/uapi/asm/swab.h                  |  6 ++--
>  arch/ia64/oprofile/backtrace.c                     |  4 +--
>  arch/m68k/include/asm/blinken.h                    |  2 +-
>  arch/m68k/include/asm/checksum.h                   |  2 +-
>  arch/m68k/include/asm/dma.h                        | 32 ++++++++---------
>  arch/m68k/include/asm/floppy.h                     |  8 ++---
>  arch/m68k/include/asm/nettel.h                     |  8 ++---
>  arch/m68k/mac/iop.c                                | 14 ++++----
>  arch/mips/include/asm/atomic.h                     | 16 ++++-----
>  arch/mips/include/asm/checksum.h                   |  2 +-
>  arch/mips/include/asm/dma.h                        | 20 +++++------
>  arch/mips/include/asm/jazz.h                       |  2 +-
>  arch/mips/include/asm/local.h                      |  4 +--
>  arch/mips/include/asm/string.h                     |  8 ++---
>  arch/mips/kernel/binfmt_elfn32.c                   |  2 +-
>  arch/nds32/include/asm/swab.h                      |  4 +--
>  arch/parisc/include/asm/atomic.h                   | 20 +++++------
>  arch/parisc/include/asm/bitops.h                   | 18 +++++-----
>  arch/parisc/include/asm/checksum.h                 |  4 +--
>  arch/parisc/include/asm/compat.h                   |  2 +-
>  arch/parisc/include/asm/delay.h                    |  2 +-
>  arch/parisc/include/asm/dma.h                      | 20 +++++------
>  arch/parisc/include/asm/ide.h                      |  8 ++---
>  arch/parisc/include/asm/irq.h                      |  2 +-
>  arch/parisc/include/asm/spinlock.h                 | 12 +++----
>  arch/powerpc/include/asm/atomic.h                  | 40 +++++++++++-----------
>  arch/powerpc/include/asm/bitops.h                  | 28 +++++++--------
>  arch/powerpc/include/asm/dma.h                     | 20 +++++------
>  arch/powerpc/include/asm/edac.h                    |  2 +-
>  arch/powerpc/include/asm/irq.h                     |  2 +-
>  arch/powerpc/include/asm/local.h                   | 14 ++++----
>  arch/sh/include/asm/pgtable_64.h                   |  2 +-
>  arch/sh/include/asm/processor_32.h                 |  4 +--
>  arch/sh/include/cpu-sh3/cpu/dac.h                  |  6 ++--
>  arch/x86/include/asm/alternative.h                 | 14 ++++----
>  arch/x86/um/asm/checksum.h                         |  4 +--
>  arch/x86/um/asm/checksum_32.h                      |  4 +--
>  arch/xtensa/include/asm/checksum.h                 | 14 ++++----
>  arch/xtensa/include/asm/cmpxchg.h                  |  4 +--
>  arch/xtensa/include/asm/irq.h                      |  2 +-
>  block/partitions/amiga.c                           |  2 +-
>  drivers/atm/he.c                                   |  6 ++--
>  drivers/atm/idt77252.c                             |  6 ++--
>  drivers/gpu/drm/mga/mga_drv.h                      |  2 +-
>  drivers/gpu/drm/mga/mga_state.c                    | 14 ++++----
>  drivers/gpu/drm/r128/r128_drv.h                    |  2 +-
>  drivers/gpu/drm/r128/r128_state.c                  | 14 ++++----
>  drivers/gpu/drm/via/via_irq.c                      |  2 +-
>  drivers/gpu/drm/via/via_verifier.c                 | 30 ++++++++--------
>  drivers/isdn/hardware/eicon/platform.h             | 14 ++++----
>  drivers/isdn/i4l/isdn_net.c                        | 14 ++++----
>  drivers/isdn/i4l/isdn_net.h                        |  8 ++---
>  drivers/media/pci/ivtv/ivtv-ioctl.c                |  2 +-
>  drivers/net/ethernet/sun/sungem.c                  |  8 ++---
>  drivers/net/ethernet/sun/sunhme.c                  |  6 ++--
>  drivers/net/hamradio/baycom_ser_fdx.c              |  2 +-
>  drivers/net/wan/lapbether.c                        |  2 +-
>  drivers/net/wan/n2.c                               |  4 +--
>  drivers/parisc/led.c                               |  4 +--
>  drivers/parisc/sba_iommu.c                         |  2 +-
>  drivers/parport/parport_gsc.c                      |  2 +-
>  drivers/parport/parport_gsc.h                      |  4 +--
>  drivers/parport/parport_pc.c                       |  2 +-
>  drivers/scsi/lpfc/lpfc_scsi.c                      |  2 +-
>  drivers/scsi/pcmcia/sym53c500_cs.c                 |  4 +--
>  drivers/scsi/qla2xxx/qla_inline.h                  |  2 +-
>  drivers/scsi/qla2xxx/qla_os.c                      |  4 +--
>  drivers/staging/rtl8723bs/core/rtw_pwrctrl.c       |  4 +--
>  drivers/staging/rtl8723bs/core/rtw_wlan_util.c     |  2 +-
>  drivers/staging/rtl8723bs/include/drv_types.h      |  6 ++--
>  drivers/staging/rtl8723bs/include/ieee80211.h      |  6 ++--
>  drivers/staging/rtl8723bs/include/osdep_service.h  | 10 +++---
>  .../rtl8723bs/include/osdep_service_linux.h        | 14 ++++----
>  drivers/staging/rtl8723bs/include/rtw_mlme.h       | 14 ++++----
>  drivers/staging/rtl8723bs/include/rtw_recv.h       | 16 ++++-----
>  drivers/staging/rtl8723bs/include/sta_info.h       |  2 +-
>  drivers/staging/rtl8723bs/include/wifi.h           | 14 ++++----
>  drivers/staging/rtl8723bs/include/wlan_bssdef.h    |  2 +-
>  drivers/tty/amiserial.c                            |  2 +-
>  drivers/tty/serial/ip22zilog.c                     |  2 +-
>  drivers/tty/serial/sunsab.c                        |  4 +--
>  drivers/tty/serial/sunzilog.c                      |  2 +-
>  drivers/video/fbdev/core/fbcon.c                   | 20 +++++------
>  drivers/video/fbdev/ffb.c                          |  2 +-
>  drivers/video/fbdev/intelfb/intelfbdrv.c           | 10 +++---
>  drivers/video/fbdev/intelfb/intelfbhw.c            |  2 +-
>  drivers/w1/masters/matrox_w1.c                     |  4 +--
>  fs/coda/coda_linux.h                               |  6 ++--
>  fs/freevxfs/vxfs_inode.c                           |  2 +-
>  fs/nfsd/nfsfh.h                                    |  4 +--
>  include/acpi/platform/acgcc.h                      |  2 +-
>  include/acpi/platform/acintel.h                    |  2 +-
>  include/asm-generic/ide_iops.h                     |  8 ++---
>  include/linux/atalk.h                              |  4 +--
>  include/linux/ceph/messenger.h                     |  2 +-
>  include/linux/compiler_types.h                     |  4 +--
>  include/linux/hdlc.h                               |  4 +--
>  include/linux/inetdevice.h                         |  8 ++---
>  include/linux/parport.h                            |  4 +--
>  include/linux/parport_pc.h                         | 22 ++++++------
>  include/net/ax25.h                                 |  2 +-
>  include/net/checksum.h                             |  2 +-
>  include/net/dn_nsp.h                               | 16 ++++-----
>  include/net/ip.h                                   |  2 +-
>  include/net/ip6_checksum.h                         |  2 +-
>  include/net/ipx.h                                  | 10 +++---
>  include/net/llc_c_ev.h                             |  4 +--
>  include/net/llc_conn.h                             |  4 +--
>  include/net/llc_s_ev.h                             |  2 +-
>  include/net/netrom.h                               |  8 ++---
>  include/net/scm.h                                  | 14 ++++----
>  include/net/udplite.h                              |  2 +-
>  include/net/x25.h                                  |  8 ++---
>  include/net/xfrm.h                                 | 18 +++++-----
>  include/uapi/linux/atm.h                           |  4 +--
>  include/uapi/linux/atmsap.h                        |  2 +-
>  include/uapi/linux/map_to_7segment.h               |  2 +-
>  include/uapi/linux/netfilter_arp/arp_tables.h      |  2 +-
>  include/uapi/linux/netfilter_bridge/ebtables.h     |  2 +-
>  include/uapi/linux/netfilter_ipv4/ip_tables.h      |  2 +-
>  include/uapi/linux/netfilter_ipv6/ip6_tables.h     |  2 +-
>  include/video/newport.h                            | 12 +++----
>  lib/zstd/mem.h                                     |  2 +-
>  net/appletalk/atalk_proc.c                         |  4 +--
>  net/appletalk/ddp.c                                |  2 +-
>  net/core/neighbour.c                               |  2 +-
>  net/core/scm.c                                     |  2 +-
>  net/decnet/dn_nsp_in.c                             |  2 +-
>  net/decnet/dn_nsp_out.c                            |  2 +-
>  net/decnet/dn_route.c                              |  2 +-
>  net/decnet/dn_table.c                              |  4 +--
>  net/ipv4/igmp.c                                    |  2 +-
>  net/ipv6/af_inet6.c                                |  2 +-
>  net/ipv6/icmp.c                                    |  4 +--
>  net/ipv6/udp.c                                     |  4 +--
>  net/lapb/lapb_iface.c                              |  4 +--
>  net/llc/llc_input.c                                |  2 +-
>  scripts/checkpatch.pl                              |  8 ++---
>  scripts/genksyms/keywords.c                        |  4 +--
>  scripts/kernel-doc                                 |  4 +--
>  sound/sparc/amd7930.c                              |  6 ++--
>  165 files changed, 547 insertions(+), 547 deletions(-)


-- 
Best Regards
Masahiro Yamada

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio-net: bql support
From: Michael S. Tsirkin @ 2018-12-26 15:22 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, linux-kernel, virtualization, maxime.coquelin, wexu,
	David S. Miller
In-Reply-To: <21384cb5-99a6-7431-1039-b356521e1bc3@redhat.com>

On Thu, Dec 06, 2018 at 04:17:36PM +0800, Jason Wang wrote:
> 
> On 2018/12/6 上午6:54, Michael S. Tsirkin wrote:
> > When use_napi is set, let's enable BQLs.  Note: some of the issues are
> > similar to wifi.  It's worth considering whether something similar to
> > commit 36148c2bbfbe ("mac80211: Adjust TSQ pacing shift") might be
> > benefitial.
> 
> 
> I've played a similar patch several days before. The tricky part is the mode
> switching between napi and no napi. We should make sure when the packet is
> sent and trakced by BQL,  it should be consumed by BQL as well.


I just went over the patch again and I don't understand this comment.
This patch only enabled BQL with tx napi.

Thus there's no mode switching.

What did I miss?


> I did it by
> tracking it through skb->cb.  And deal with the freeze by reset the BQL
> status. Patch attached.
> 
> But when testing with vhost-net, I don't very a stable performance, it was
> probably because we batch the used ring updating so tx interrupt may come
> randomly. We probably need to implement time bounded coalescing mechanism
> which could be configured from userspace.
> 
> Btw, maybe it's time just enable napi TX by default. I get ~10% TCP_RR
> regression on machine without APICv, (haven't found time to test APICv
> machine). But consider it was for correctness, I think it's acceptable? Then
> we can do optimization on top?
> 
> 
> Thanks
> 
> 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >   drivers/net/virtio_net.c | 27 +++++++++++++++++++--------
> >   1 file changed, 19 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index cecfd77c9f3c..b657bde6b94b 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1325,7 +1325,8 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
> >   	return stats.packets;
> >   }
> > -static void free_old_xmit_skbs(struct send_queue *sq)
> > +static void free_old_xmit_skbs(struct send_queue *sq, struct netdev_queue *txq,
> > +			       bool use_napi)
> >   {
> >   	struct sk_buff *skb;
> >   	unsigned int len;
> > @@ -1347,6 +1348,9 @@ static void free_old_xmit_skbs(struct send_queue *sq)
> >   	if (!packets)
> >   		return;
> > +	if (use_napi)
> > +		netdev_tx_completed_queue(txq, packets, bytes);
> > +
> >   	u64_stats_update_begin(&sq->stats.syncp);
> >   	sq->stats.bytes += bytes;
> >   	sq->stats.packets += packets;
> > @@ -1364,7 +1368,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
> >   		return;
> >   	if (__netif_tx_trylock(txq)) {
> > -		free_old_xmit_skbs(sq);
> > +		free_old_xmit_skbs(sq, txq, true);
> >   		__netif_tx_unlock(txq);
> >   	}
> > @@ -1440,7 +1444,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> >   	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
> >   	__netif_tx_lock(txq, raw_smp_processor_id());
> > -	free_old_xmit_skbs(sq);
> > +	free_old_xmit_skbs(sq, txq, true);
> >   	__netif_tx_unlock(txq);
> >   	virtqueue_napi_complete(napi, sq->vq, 0);
> > @@ -1505,13 +1509,15 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >   	struct send_queue *sq = &vi->sq[qnum];
> >   	int err;
> >   	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> > -	bool kick = !skb->xmit_more;
> > +	bool more = skb->xmit_more;
> >   	bool use_napi = sq->napi.weight;
> > +	unsigned int bytes = skb->len;
> > +	bool kick;
> >   	/* Free up any pending old buffers before queueing new ones. */
> > -	free_old_xmit_skbs(sq);
> > +	free_old_xmit_skbs(sq, txq, use_napi);
> > -	if (use_napi && kick)
> > +	if (use_napi && !more)
> >   		virtqueue_enable_cb_delayed(sq->vq);
> >   	/* timestamp packet in software */
> > @@ -1552,7 +1558,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >   		if (!use_napi &&
> >   		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> >   			/* More just got used, free them then recheck. */
> > -			free_old_xmit_skbs(sq);
> > +			free_old_xmit_skbs(sq, txq, false);
> >   			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
> >   				netif_start_subqueue(dev, qnum);
> >   				virtqueue_disable_cb(sq->vq);
> > @@ -1560,7 +1566,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >   		}
> >   	}
> > -	if (kick || netif_xmit_stopped(txq)) {
> > +	if (use_napi)
> > +		kick = __netdev_tx_sent_queue(txq, bytes, more);
> > +	else
> > +		kick = !more || netif_xmit_stopped(txq);
> > +
> > +	if (kick) {
> >   		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq)) {
> >   			u64_stats_update_begin(&sq->stats.syncp);
> >   			sq->stats.kicks++;
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio-net: bql support
From: Michael S. Tsirkin @ 2018-12-26 15:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, linux-kernel, virtualization, maxime.coquelin, wexu,
	David S. Miller
In-Reply-To: <21384cb5-99a6-7431-1039-b356521e1bc3@redhat.com>

On Thu, Dec 06, 2018 at 04:17:36PM +0800, Jason Wang wrote:
> 
> On 2018/12/6 上午6:54, Michael S. Tsirkin wrote:
> > When use_napi is set, let's enable BQLs.  Note: some of the issues are
> > similar to wifi.  It's worth considering whether something similar to
> > commit 36148c2bbfbe ("mac80211: Adjust TSQ pacing shift") might be
> > benefitial.
> 
> 
> I've played a similar patch several days before. The tricky part is the mode
> switching between napi and no napi. We should make sure when the packet is
> sent and trakced by BQL,  it should be consumed by BQL as well. I did it by
> tracking it through skb->cb.  And deal with the freeze by reset the BQL
> status. Patch attached.
> 
> But when testing with vhost-net, I don't very a stable performance,

So how about increasing TSQ pacing shift then?

> it was
> probably because we batch the used ring updating so tx interrupt may come
> randomly. We probably need to implement time bounded coalescing mechanism
> which could be configured from userspace.

I don't think it's reasonable to expect userspace to be that smart ...
Why do we need time bounded? used ring is always updated when ring
becomes empty.

> Btw, maybe it's time just enable napi TX by default. I get ~10% TCP_RR
> regression on machine without APICv, (haven't found time to test APICv
> machine). But consider it was for correctness, I think it's acceptable? Then
> we can do optimization on top?
> 
> 
> Thanks
> 
> 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >   drivers/net/virtio_net.c | 27 +++++++++++++++++++--------
> >   1 file changed, 19 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index cecfd77c9f3c..b657bde6b94b 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1325,7 +1325,8 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
> >   	return stats.packets;
> >   }
> > -static void free_old_xmit_skbs(struct send_queue *sq)
> > +static void free_old_xmit_skbs(struct send_queue *sq, struct netdev_queue *txq,
> > +			       bool use_napi)
> >   {
> >   	struct sk_buff *skb;
> >   	unsigned int len;
> > @@ -1347,6 +1348,9 @@ static void free_old_xmit_skbs(struct send_queue *sq)
> >   	if (!packets)
> >   		return;
> > +	if (use_napi)
> > +		netdev_tx_completed_queue(txq, packets, bytes);
> > +
> >   	u64_stats_update_begin(&sq->stats.syncp);
> >   	sq->stats.bytes += bytes;
> >   	sq->stats.packets += packets;
> > @@ -1364,7 +1368,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
> >   		return;
> >   	if (__netif_tx_trylock(txq)) {
> > -		free_old_xmit_skbs(sq);
> > +		free_old_xmit_skbs(sq, txq, true);
> >   		__netif_tx_unlock(txq);
> >   	}
> > @@ -1440,7 +1444,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> >   	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
> >   	__netif_tx_lock(txq, raw_smp_processor_id());
> > -	free_old_xmit_skbs(sq);
> > +	free_old_xmit_skbs(sq, txq, true);
> >   	__netif_tx_unlock(txq);
> >   	virtqueue_napi_complete(napi, sq->vq, 0);
> > @@ -1505,13 +1509,15 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >   	struct send_queue *sq = &vi->sq[qnum];
> >   	int err;
> >   	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> > -	bool kick = !skb->xmit_more;
> > +	bool more = skb->xmit_more;
> >   	bool use_napi = sq->napi.weight;
> > +	unsigned int bytes = skb->len;
> > +	bool kick;
> >   	/* Free up any pending old buffers before queueing new ones. */
> > -	free_old_xmit_skbs(sq);
> > +	free_old_xmit_skbs(sq, txq, use_napi);
> > -	if (use_napi && kick)
> > +	if (use_napi && !more)
> >   		virtqueue_enable_cb_delayed(sq->vq);
> >   	/* timestamp packet in software */
> > @@ -1552,7 +1558,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >   		if (!use_napi &&
> >   		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> >   			/* More just got used, free them then recheck. */
> > -			free_old_xmit_skbs(sq);
> > +			free_old_xmit_skbs(sq, txq, false);
> >   			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
> >   				netif_start_subqueue(dev, qnum);
> >   				virtqueue_disable_cb(sq->vq);
> > @@ -1560,7 +1566,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >   		}
> >   	}
> > -	if (kick || netif_xmit_stopped(txq)) {
> > +	if (use_napi)
> > +		kick = __netdev_tx_sent_queue(txq, bytes, more);
> > +	else
> > +		kick = !more || netif_xmit_stopped(txq);
> > +
> > +	if (kick) {
> >   		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq)) {
> >   			u64_stats_update_begin(&sq->stats.syncp);
> >   			sq->stats.kicks++;
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio-net: bql support
From: Michael S. Tsirkin @ 2018-12-26 15:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, linux-kernel, virtualization, maxime.coquelin, wexu,
	David S. Miller
In-Reply-To: <21384cb5-99a6-7431-1039-b356521e1bc3@redhat.com>

On Thu, Dec 06, 2018 at 04:17:36PM +0800, Jason Wang wrote:
> 
> On 2018/12/6 上午6:54, Michael S. Tsirkin wrote:
> > When use_napi is set, let's enable BQLs.  Note: some of the issues are
> > similar to wifi.  It's worth considering whether something similar to
> > commit 36148c2bbfbe ("mac80211: Adjust TSQ pacing shift") might be
> > benefitial.
> 
> 
> I've played a similar patch several days before. The tricky part is the mode
> switching between napi and no napi. We should make sure when the packet is
> sent and trakced by BQL,  it should be consumed by BQL as well. I did it by
> tracking it through skb->cb.  And deal with the freeze by reset the BQL
> status. Patch attached.
> 
> But when testing with vhost-net, I don't very a stable performance, it was
> probably because we batch the used ring updating so tx interrupt may come
> randomly. We probably need to implement time bounded coalescing mechanism
> which could be configured from userspace.
> 
> Btw, maybe it's time just enable napi TX by default. I get ~10% TCP_RR
> regression on machine without APICv, (haven't found time to test APICv
> machine). But consider it was for correctness, I think it's acceptable? Then
> we can do optimization on top?
> 
> 
> Thanks

I don't see how it's for correctness to be frank.
What if we just do the bulk free? Does that fix the regression?


> 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >   drivers/net/virtio_net.c | 27 +++++++++++++++++++--------
> >   1 file changed, 19 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index cecfd77c9f3c..b657bde6b94b 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1325,7 +1325,8 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
> >   	return stats.packets;
> >   }
> > -static void free_old_xmit_skbs(struct send_queue *sq)
> > +static void free_old_xmit_skbs(struct send_queue *sq, struct netdev_queue *txq,
> > +			       bool use_napi)
> >   {
> >   	struct sk_buff *skb;
> >   	unsigned int len;
> > @@ -1347,6 +1348,9 @@ static void free_old_xmit_skbs(struct send_queue *sq)
> >   	if (!packets)
> >   		return;
> > +	if (use_napi)
> > +		netdev_tx_completed_queue(txq, packets, bytes);
> > +
> >   	u64_stats_update_begin(&sq->stats.syncp);
> >   	sq->stats.bytes += bytes;
> >   	sq->stats.packets += packets;
> > @@ -1364,7 +1368,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
> >   		return;
> >   	if (__netif_tx_trylock(txq)) {
> > -		free_old_xmit_skbs(sq);
> > +		free_old_xmit_skbs(sq, txq, true);
> >   		__netif_tx_unlock(txq);
> >   	}
> > @@ -1440,7 +1444,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> >   	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
> >   	__netif_tx_lock(txq, raw_smp_processor_id());
> > -	free_old_xmit_skbs(sq);
> > +	free_old_xmit_skbs(sq, txq, true);
> >   	__netif_tx_unlock(txq);
> >   	virtqueue_napi_complete(napi, sq->vq, 0);
> > @@ -1505,13 +1509,15 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >   	struct send_queue *sq = &vi->sq[qnum];
> >   	int err;
> >   	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> > -	bool kick = !skb->xmit_more;
> > +	bool more = skb->xmit_more;
> >   	bool use_napi = sq->napi.weight;
> > +	unsigned int bytes = skb->len;
> > +	bool kick;
> >   	/* Free up any pending old buffers before queueing new ones. */
> > -	free_old_xmit_skbs(sq);
> > +	free_old_xmit_skbs(sq, txq, use_napi);
> > -	if (use_napi && kick)
> > +	if (use_napi && !more)
> >   		virtqueue_enable_cb_delayed(sq->vq);
> >   	/* timestamp packet in software */
> > @@ -1552,7 +1558,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >   		if (!use_napi &&
> >   		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> >   			/* More just got used, free them then recheck. */
> > -			free_old_xmit_skbs(sq);
> > +			free_old_xmit_skbs(sq, txq, false);
> >   			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
> >   				netif_start_subqueue(dev, qnum);
> >   				virtqueue_disable_cb(sq->vq);
> > @@ -1560,7 +1566,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >   		}
> >   	}
> > -	if (kick || netif_xmit_stopped(txq)) {
> > +	if (use_napi)
> > +		kick = __netdev_tx_sent_queue(txq, bytes, more);
> > +	else
> > +		kick = !more || netif_xmit_stopped(txq);
> > +
> > +	if (kick) {
> >   		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq)) {
> >   			u64_stats_update_begin(&sq->stats.syncp);
> >   			sq->stats.kicks++;
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
From: Michael S. Tsirkin @ 2018-12-26 15:02 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <70978ed8-bf76-693a-0e11-d31b6234af5c@redhat.com>

On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote:
> 
> On 2018/12/25 下午8:50, Michael S. Tsirkin wrote:
> > On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
> > > On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
> > > > On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
> > > > > On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
> > > > > > On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
> > > > > > > On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
> > > > > > > > On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
> > > > > > > > > It was noticed that the copy_user() friends that was used to access
> > > > > > > > > virtqueue metdata tends to be very expensive for dataplane
> > > > > > > > > implementation like vhost since it involves lots of software check,
> > > > > > > > > speculation barrier, hardware feature toggling (e.g SMAP). The
> > > > > > > > > extra cost will be more obvious when transferring small packets.
> > > > > > > > > 
> > > > > > > > > This patch tries to eliminate those overhead by pin vq metadata pages
> > > > > > > > > and access them through vmap(). During SET_VRING_ADDR, we will setup
> > > > > > > > > those mappings and memory accessors are modified to use pointers to
> > > > > > > > > access the metadata directly.
> > > > > > > > > 
> > > > > > > > > Note, this was only done when device IOTLB is not enabled. We could
> > > > > > > > > use similar method to optimize it in the future.
> > > > > > > > > 
> > > > > > > > > Tests shows about ~24% improvement on TX PPS when using virtio-user +
> > > > > > > > > vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
> > > > > > > > > 
> > > > > > > > > Before: ~5.0Mpps
> > > > > > > > > After:  ~6.1Mpps
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Jason Wang<jasowang@redhat.com>
> > > > > > > > > ---
> > > > > > > > >      drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > >      drivers/vhost/vhost.h |  11 +++
> > > > > > > > >      2 files changed, 189 insertions(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > > > index bafe39d2e637..1bd24203afb6 100644
> > > > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > > > @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
> > > > > > > > >      		vq->indirect = NULL;
> > > > > > > > >      		vq->heads = NULL;
> > > > > > > > >      		vq->dev = dev;
> > > > > > > > > +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
> > > > > > > > > +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
> > > > > > > > > +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
> > > > > > > > >      		mutex_init(&vq->mutex);
> > > > > > > > >      		vhost_vq_reset(dev, vq);
> > > > > > > > >      		if (vq->handle_kick)
> > > > > > > > > @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
> > > > > > > > >      	spin_unlock(&dev->iotlb_lock);
> > > > > > > > >      }
> > > > > > > > > +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
> > > > > > > > > +			   size_t size, int write)
> > > > > > > > > +{
> > > > > > > > > +	struct page **pages;
> > > > > > > > > +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
> > > > > > > > > +	int npinned;
> > > > > > > > > +	void *vaddr;
> > > > > > > > > +
> > > > > > > > > +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> > > > > > > > > +	if (!pages)
> > > > > > > > > +		return -ENOMEM;
> > > > > > > > > +
> > > > > > > > > +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
> > > > > > > > > +	if (npinned != npages)
> > > > > > > > > +		goto err;
> > > > > > > > > +
> > > > > > > > As I said I have doubts about the whole approach, but this
> > > > > > > > implementation in particular isn't a good idea
> > > > > > > > as it keeps the page around forever.
> > > > > The pages wil be released during set features.
> > > > > 
> > > > > 
> > > > > > > > So no THP, no NUMA rebalancing,
> > > > > For THP, we will probably miss 2 or 4 pages, but does this really matter
> > > > > consider the gain we have?
> > > > We as in vhost? networking isn't the only thing guest does.
> > > > We don't even know if this guest does a lot of networking.
> > > > You don't
> > > > know what else is in this huge page. Can be something very important
> > > > that guest touches all the time.
> > > 
> > > Well, the probability should be very small consider we usually give several
> > > gigabytes to guest. The rest of the pages that doesn't sit in the same
> > > hugepage with metadata can still be merged by THP.  Anyway, I can test the
> > > differences.
> > Thanks!
> > 
> > > > > For NUMA rebalancing, I'm even not quite sure if
> > > > > it can helps for the case of IPC (vhost). It looks to me the worst case it
> > > > > may cause page to be thrash between nodes if vhost and userspace are running
> > > > > in two nodes.
> > > > So again it's a gain for vhost but has a completely unpredictable effect on
> > > > other functionality of the guest.
> > > > 
> > > > That's what bothers me with this approach.
> > > 
> > > So:
> > > 
> > > - The rest of the pages could still be balanced to other nodes, no?
> > > 
> > > - try to balance metadata pages (belongs to co-operate processes) itself is
> > > still questionable
> > I am not sure why. It should be easy enough to force the VCPU and vhost
> > to move (e.g. start them pinned to 1 cpu, then pin them to another one).
> > Clearly sometimes this would be necessary for load balancing reasons.
> 
> 
> Yes, but it looks to me the part of motivation of auto NUMA is to avoid
> manual pinning.

... of memory. Yes.


> 
> > With autonuma after a while (could take seconds but it will happen) the
> > memory will migrate.
> > 
> 
> Yes. As you mentioned during the discuss, I wonder we could do it similarly
> through mmu notifier like APIC access page in commit c24ae0dcd3e ("kvm: x86:
> Unpin and remove kvm_arch->apic_access_page")

That would be a possible approach.

> 
> > 
> > 
> > > > 
> > > > 
> > > > 
> > > > > > > This is the price of all GUP users not only vhost itself.
> > > > > > Yes. GUP is just not a great interface for vhost to use.
> > > > > Zerocopy codes (enabled by defualt) use them for years.
> > > > But only for TX and temporarily. We pin, read, unpin.
> > > 
> > > Probably not. For several reasons that the page will be not be released soon
> > > or held for a very long period of time or even forever.
> > 
> > With zero copy? Well it's pinned until transmit. Takes a while
> > but could be enough for autocopy to work esp since
> > its the packet memory so not reused immediately.
> > 
> > > > Your patch is different
> > > > 
> > > > - it writes into memory and GUP has known issues with file
> > > >     backed memory
> > > 
> > > The ordinary user for vhost is anonymous pages I think?
> > 
> > It's not the most common scenario and not the fastest one
> > (e.g. THP does not work) but file backed is useful sometimes.
> > It would not be nice at all to corrupt guest memory in that case.
> 
> 
> Ok.
> 
> 
> > 
> > > > - it keeps pages pinned forever
> > > > 
> > > > 
> > > > 
> > > > > > > What's more
> > > > > > > important, the goal is not to be left too much behind for other backends
> > > > > > > like DPDK or AF_XDP (all of which are using GUP).
> > > > > > So these guys assume userspace knows what it's doing.
> > > > > > We can't assume that.
> > > > > What kind of assumption do you they have?
> > > > > 
> > > > > 
> > > > > > > > userspace-controlled
> > > > > > > > amount of memory locked up and not accounted for.
> > > > > > > It's pretty easy to add this since the slow path was still kept. If we
> > > > > > > exceeds the limitation, we can switch back to slow path.
> > > > > > > 
> > > > > > > > Don't get me wrong it's a great patch in an ideal world.
> > > > > > > > But then in an ideal world no barriers smap etc are necessary at all.
> > > > > > > Again, this is only for metadata accessing not the data which has been used
> > > > > > > for years for real use cases.
> > > > > > > 
> > > > > > > For SMAP, it makes senses for the address that kernel can not forcast. But
> > > > > > > it's not the case for the vhost metadata since we know the address will be
> > > > > > > accessed very frequently. For speculation barrier, it helps nothing for the
> > > > > > > data path of vhost which is a kthread.
> > > > > > I don't see how a kthread makes any difference. We do have a validation
> > > > > > step which makes some difference.
> > > > > The problem is not kthread but the address of userspace address. The
> > > > > addresses of vq metadata tends to be consistent for a while, and vhost knows
> > > > > they will be frequently. SMAP doesn't help too much in this case.
> > > > > 
> > > > > Thanks.
> > > > It's true for a real life applications but a malicious one
> > > > can call the setup ioctls any number of times. And SMAP is
> > > > all about malcious applications.
> > > 
> > > We don't do this in the path of ioctl, there's no context switch between
> > > userspace and kernel in the worker thread. SMAP is used to prevent kernel
> > > from accessing userspace pages unexpectedly which is not the case for
> > > metadata access.
> > > 
> > > Thanks
> > OK let's forget smap for now.
> 
> 
> Some numbers I measured:
> 
> On an old Sandy bridge machine without SMAP support. Remove speculation
> barrier boost the performance from 4.6Mpps to 5.1Mpps
> 
> On a newer Broadwell machine with SMAP support. Remove speculation barrier
> only gives 2%-5% improvement, disable SMAP completely through Kconfig boost
> 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps, it
> only bypass SMAP for metadata).
> 
> So it looks like for recent machine, SMAP becomes pain point when the copy
> is short (e.g 64B) for high PPS.
> 
> Thanks

Thanks a lot for looking into this!

So first of all users can just boot with nosmap, right?
What's wrong with that? Yes it's not fine-grained but OTOH
it's easy to understand.

And I guess this confirms that if we are going to worry
about smap enabled, we need to look into packet copies
too, not just meta-data.

Vaguely could see a module option (off by default)
where vhost basically does user_access_begin
when it starts running, then uses unsafe accesses
in vhost and tun and then user_access_end.


> 
> > 
> > > > > > > Packet or AF_XDP benefit from
> > > > > > > accessing metadata directly, we should do it as well.
> > > > > > > 
> > > > > > > Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net V2 4/4] vhost: log dirty page correctly
From: Michael S. Tsirkin @ 2018-12-26 13:46 UTC (permalink / raw)
  To: Jason Wang; +Cc: Jintack Lim, netdev, linux-kernel, kvm, virtualization
In-Reply-To: <2a78e991-1917-256b-4f09-60c228c17979@redhat.com>

On Wed, Dec 26, 2018 at 01:43:26PM +0800, Jason Wang wrote:
> 
> On 2018/12/26 上午12:25, Michael S. Tsirkin wrote:
> > On Tue, Dec 25, 2018 at 05:43:25PM +0800, Jason Wang wrote:
> > > On 2018/12/25 上午1:41, Michael S. Tsirkin wrote:
> > > > On Mon, Dec 24, 2018 at 11:43:31AM +0800, Jason Wang wrote:
> > > > > On 2018/12/14 下午9:20, Michael S. Tsirkin wrote:
> > > > > > On Fri, Dec 14, 2018 at 10:43:03AM +0800, Jason Wang wrote:
> > > > > > > On 2018/12/13 下午10:31, Michael S. Tsirkin wrote:
> > > > > > > > > Just to make sure I understand this. It looks to me we should:
> > > > > > > > > 
> > > > > > > > > - allow passing GIOVA->GPA through UAPI
> > > > > > > > > 
> > > > > > > > > - cache GIOVA->GPA somewhere but still use GIOVA->HVA in device IOTLB for
> > > > > > > > > performance
> > > > > > > > > 
> > > > > > > > > Is this what you suggest?
> > > > > > > > > 
> > > > > > > > > Thanks
> > > > > > > > Not really. We already have GPA->HVA, so I suggested a flag to pass
> > > > > > > > GIOVA->GPA in the IOTLB.
> > > > > > > > 
> > > > > > > > This has advantages for security since a single table needs
> > > > > > > > then to be validated to ensure guest does not corrupt
> > > > > > > > QEMU memory.
> > > > > > > > 
> > > > > > > I wonder how much we can gain through this. Currently, qemu IOMMU gives
> > > > > > > GIOVA->GPA mapping, and qemu vhost code will translate GPA to HVA then pass
> > > > > > > GIOVA->HVA to vhost. It looks no difference to me.
> > > > > > > 
> > > > > > > Thanks
> > > > > > The difference is in security not in performance.  Getting a bad HVA
> > > > > > corrupts QEMU memory and it might be guest controlled. Very risky.
> > > > > How can this be controlled by guest? HVA was generated from qemu ram blocks
> > > > > which is totally under the control of qemu memory core instead of guest.
> > > > > 
> > > > > 
> > > > > Thanks
> > > > It is ultimately under guest influence as guest supplies IOVA->GPA
> > > > translations.  qemu translates GPA->HVA and gives the translated result
> > > > to the kernel.  If it's not buggy and kernel isn't buggy it's all
> > > > fine.
> > > 
> > > If qemu provides buggy GPA->HVA, we can't workaround this. And I don't get
> > > the point why we even want to try this. Buggy qemu code can crash itself in
> > > many ways.
> > > 
> > > 
> > > > But that's the approach that was proven not to work in the 20th century.
> > > > In the 21st century we are trying defence in depth approach.
> > > > 
> > > > My point is that a single code path that is responsible for
> > > > the HVA translations is better than two.
> > > > 
> > > So the difference whether or not use memory table information:
> > > 
> > > Current:
> > > 
> > > 1) SET_MEM_TABLE: GPA->HVA
> > > 
> > > 2) Qemu GIOVA->GPA
> > > 
> > > 3) Qemu GPA->HVA
> > > 
> > > 4) IOTLB_UPDATE: GIOVA->HVA
> > > 
> > > If I understand correctly you want to drop step 3 consider it might be buggy
> > > which is just 19 lines of code in qemu (vhost_memory_region_lookup()). This
> > > will ends up:
> > > 
> > > 1) Do GPA->HVA translation in IOTLB_UPDATE path (I believe we won't want to
> > > do it during device IOTLB lookup).
> > > 
> > > 2) Extra bits to enable this capability.
> > > 
> > > So this looks need more codes in kernel than what qemu did in userspace.  Is
> > > this really worthwhile?
> > > 
> > > Thanks
> > So there are several points I would like to make
> > 
> > 1. At the moment without an iommu it is possible to
> >     change GPA-HVA mappings and everything keeps working
> >     because a change in memory tables flushes the rings.
> 
> 
> Interesting, I don't know this before. But when can this happen?


It doesn't happen with existing qemu. But it seems like a valid
thing to do to remap memory at a different address.


> 
> >     However I don't see the iotlb cache being invalidated
> >     on that path - did I miss it? If it is not there it's
> >     a related minor bug.
> 
> 
> It might have a bug. But a question is consider the case without IOMMU. We
> only update mem table (SET_MEM_TABLE), but not vring address. This looks
> like a bug as well?

I think that without an iommu it can only work without races if backend is
stopped or if the vring isn't in guest memory with ring aliasing).


> 
> > 
> > 2. qemu already has a GPA. Discarding it and re-calculating
> >     when logging is on just seems wrong.
> >     However if you would like to *also* keep the HVA in the iotlb
> >     to avoid doing extra translations, that sounds like a
> >     reasonable optimization.
> 
> 
> Yes, traverse GPA->HVA mapping seems unnecessary.
> 
> 
> > 
> > 3. it also means that the hva->gpa translation only runs
> >     when logging is enabled. That is a rarely excercised
> >     path so any bugs there will not be caught.
> 
> 
> I wonder maybe some kind of unit-test may help here.
> 
> 
> > 
> > So I really would like us long term to move away from
> > hva->gpa translations, keep them for legacy userspace only
> > but I don't really mind how we do it.
> > 
> > How about
> > - a new flag to pass an iotlb with *both* a gpa and hva
> > - for legacy userspace, calculate the gpa on iotlb update
> >    so the device then uses a shared code path
> > 
> > what do you think?
> > 
> > 
> 
> I don't object this idea so I can try, just want to figure out why it was a
> must.
> 
> Thanks

Not a must but I think it's a good interface extension.

-- 
MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net V2 4/4] vhost: log dirty page correctly
From: Jason Wang @ 2018-12-26  5:43 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jintack Lim, netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181225111716-mutt-send-email-mst@kernel.org>


On 2018/12/26 上午12:25, Michael S. Tsirkin wrote:
> On Tue, Dec 25, 2018 at 05:43:25PM +0800, Jason Wang wrote:
>> On 2018/12/25 上午1:41, Michael S. Tsirkin wrote:
>>> On Mon, Dec 24, 2018 at 11:43:31AM +0800, Jason Wang wrote:
>>>> On 2018/12/14 下午9:20, Michael S. Tsirkin wrote:
>>>>> On Fri, Dec 14, 2018 at 10:43:03AM +0800, Jason Wang wrote:
>>>>>> On 2018/12/13 下午10:31, Michael S. Tsirkin wrote:
>>>>>>>> Just to make sure I understand this. It looks to me we should:
>>>>>>>>
>>>>>>>> - allow passing GIOVA->GPA through UAPI
>>>>>>>>
>>>>>>>> - cache GIOVA->GPA somewhere but still use GIOVA->HVA in device IOTLB for
>>>>>>>> performance
>>>>>>>>
>>>>>>>> Is this what you suggest?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>> Not really. We already have GPA->HVA, so I suggested a flag to pass
>>>>>>> GIOVA->GPA in the IOTLB.
>>>>>>>
>>>>>>> This has advantages for security since a single table needs
>>>>>>> then to be validated to ensure guest does not corrupt
>>>>>>> QEMU memory.
>>>>>>>
>>>>>> I wonder how much we can gain through this. Currently, qemu IOMMU gives
>>>>>> GIOVA->GPA mapping, and qemu vhost code will translate GPA to HVA then pass
>>>>>> GIOVA->HVA to vhost. It looks no difference to me.
>>>>>>
>>>>>> Thanks
>>>>> The difference is in security not in performance.  Getting a bad HVA
>>>>> corrupts QEMU memory and it might be guest controlled. Very risky.
>>>> How can this be controlled by guest? HVA was generated from qemu ram blocks
>>>> which is totally under the control of qemu memory core instead of guest.
>>>>
>>>>
>>>> Thanks
>>> It is ultimately under guest influence as guest supplies IOVA->GPA
>>> translations.  qemu translates GPA->HVA and gives the translated result
>>> to the kernel.  If it's not buggy and kernel isn't buggy it's all
>>> fine.
>>
>> If qemu provides buggy GPA->HVA, we can't workaround this. And I don't get
>> the point why we even want to try this. Buggy qemu code can crash itself in
>> many ways.
>>
>>
>>> But that's the approach that was proven not to work in the 20th century.
>>> In the 21st century we are trying defence in depth approach.
>>>
>>> My point is that a single code path that is responsible for
>>> the HVA translations is better than two.
>>>
>> So the difference whether or not use memory table information:
>>
>> Current:
>>
>> 1) SET_MEM_TABLE: GPA->HVA
>>
>> 2) Qemu GIOVA->GPA
>>
>> 3) Qemu GPA->HVA
>>
>> 4) IOTLB_UPDATE: GIOVA->HVA
>>
>> If I understand correctly you want to drop step 3 consider it might be buggy
>> which is just 19 lines of code in qemu (vhost_memory_region_lookup()). This
>> will ends up:
>>
>> 1) Do GPA->HVA translation in IOTLB_UPDATE path (I believe we won't want to
>> do it during device IOTLB lookup).
>>
>> 2) Extra bits to enable this capability.
>>
>> So this looks need more codes in kernel than what qemu did in userspace.  Is
>> this really worthwhile?
>>
>> Thanks
> So there are several points I would like to make
>
> 1. At the moment without an iommu it is possible to
>     change GPA-HVA mappings and everything keeps working
>     because a change in memory tables flushes the rings.


Interesting, I don't know this before. But when can this happen?


>     However I don't see the iotlb cache being invalidated
>     on that path - did I miss it? If it is not there it's
>     a related minor bug.


It might have a bug. But a question is consider the case without IOMMU. 
We only update mem table (SET_MEM_TABLE), but not vring address. This 
looks like a bug as well?


>
> 2. qemu already has a GPA. Discarding it and re-calculating
>     when logging is on just seems wrong.
>     However if you would like to *also* keep the HVA in the iotlb
>     to avoid doing extra translations, that sounds like a
>     reasonable optimization.


Yes, traverse GPA->HVA mapping seems unnecessary.


>
> 3. it also means that the hva->gpa translation only runs
>     when logging is enabled. That is a rarely excercised
>     path so any bugs there will not be caught.


I wonder maybe some kind of unit-test may help here.


>
> So I really would like us long term to move away from
> hva->gpa translations, keep them for legacy userspace only
> but I don't really mind how we do it.
>
> How about
> - a new flag to pass an iotlb with *both* a gpa and hva
> - for legacy userspace, calculate the gpa on iotlb update
>    so the device then uses a shared code path
>
> what do you think?
>
>

I don't object this idea so I can try, just want to figure out why it 
was a must.

Thanks


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
From: Jason Wang @ 2018-12-26  3:59 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181225075054-mutt-send-email-mst@kernel.org>


On 2018/12/25 下午8:52, Michael S. Tsirkin wrote:
> On Tue, Dec 25, 2018 at 06:09:19PM +0800, Jason Wang wrote:
>> On 2018/12/25 上午2:12, Michael S. Tsirkin wrote:
>>> On Mon, Dec 24, 2018 at 04:32:39PM +0800, Jason Wang wrote:
>>>> On 2018/12/14 下午8:33, Michael S. Tsirkin wrote:
>>>>> On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
>>>>>> On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>>>>>>>> Hi:
>>>>>>>>
>>>>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>>>>> address instead of copy_user() friends since they had too much
>>>>>>>> overheads like checks, spec barriers or even hardware feature
>>>>>>>> toggling.
>>>>>>> Userspace accesses through remapping tricks and next time there's a need
>>>>>>> for a new barrier we are left to figure it out by ourselves.
>>>>>> I don't get here, do you mean spec barriers?
>>>>> I mean the next barrier people decide to put into userspace
>>>>> memory accesses.
>>>>>
>>>>>> It's completely unnecessary for
>>>>>> vhost which is kernel thread.
>>>>> It's defence in depth. Take a look at the commit that added them.
>>>>> And yes quite possibly in most cases we actually have a spec
>>>>> barrier in the validation phase. If we do let's use the
>>>>> unsafe variants so they can be found.
>>>> unsafe variants can only work if you can batch userspace access. This is not
>>>> necessarily the case for light load.
>>> Do we care a lot about the light load? How would you benchmark it?
>>>
>> If we can reduce the latency that's will be more than what we expect.
>>
>> 1 byte TCP_RR shows 1.5%-2% improvement.
> It's nice but not great. E.g. adaptive polling would be
> a better approach to work on latency imho.


Actually this is another advantages of vmap():

For split ring, we will poll avail idx

For packed ring, we will poll wrap counter

Either of which can not be batched.


>
>>>>>> And even if you're right, vhost is not the
>>>>>> only place, there's lots of vmap() based accessing in kernel.
>>>>> For sure. But if one can get by without get user pages, one
>>>>> really should. Witness recently uncovered mess with file
>>>>> backed storage.
>>>> We only pin metadata pages, I don't believe they will be used by any DMA.
>>> It doesn't matter really, if you dirty pages behind the MM back
>>> the problem is there.
>>
>> Ok, but the usual case is anonymous pages, do we use file backed pages for
>> user of vhost?
> Some people use file backed pages for vms.
> Nothing prevents them from using vhost as well.


Ok.


>
>> And even if we use sometime, according to the pointer it's
>> not something that can fix, RFC has been posted to solve this issue.
>>
>> Thanks
> Except it's not broken if we don't to gup + write.
> So yea, wait for rfc to be merged.
>

Yes.

Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
From: Jason Wang @ 2018-12-26  3:57 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181225071501-mutt-send-email-mst@kernel.org>


On 2018/12/25 下午8:50, Michael S. Tsirkin wrote:
> On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
>> On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
>>> On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
>>>> On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
>>>>> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
>>>>>> On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>>>>>>>> It was noticed that the copy_user() friends that was used to access
>>>>>>>> virtqueue metdata tends to be very expensive for dataplane
>>>>>>>> implementation like vhost since it involves lots of software check,
>>>>>>>> speculation barrier, hardware feature toggling (e.g SMAP). The
>>>>>>>> extra cost will be more obvious when transferring small packets.
>>>>>>>>
>>>>>>>> This patch tries to eliminate those overhead by pin vq metadata pages
>>>>>>>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>>>>>>>> those mappings and memory accessors are modified to use pointers to
>>>>>>>> access the metadata directly.
>>>>>>>>
>>>>>>>> Note, this was only done when device IOTLB is not enabled. We could
>>>>>>>> use similar method to optimize it in the future.
>>>>>>>>
>>>>>>>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>>>>>>>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>>>>>>>
>>>>>>>> Before: ~5.0Mpps
>>>>>>>> After:  ~6.1Mpps
>>>>>>>>
>>>>>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>>>>>> ---
>>>>>>>>      drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>      drivers/vhost/vhost.h |  11 +++
>>>>>>>>      2 files changed, 189 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>>>> index bafe39d2e637..1bd24203afb6 100644
>>>>>>>> --- a/drivers/vhost/vhost.c
>>>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>>>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>>>>>>      		vq->indirect = NULL;
>>>>>>>>      		vq->heads = NULL;
>>>>>>>>      		vq->dev = dev;
>>>>>>>> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>>>>>>>> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>>>>>>>> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>>>>>>>      		mutex_init(&vq->mutex);
>>>>>>>>      		vhost_vq_reset(dev, vq);
>>>>>>>>      		if (vq->handle_kick)
>>>>>>>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>>>>>>>      	spin_unlock(&dev->iotlb_lock);
>>>>>>>>      }
>>>>>>>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>>>>>>>> +			   size_t size, int write)
>>>>>>>> +{
>>>>>>>> +	struct page **pages;
>>>>>>>> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>>>>>>>> +	int npinned;
>>>>>>>> +	void *vaddr;
>>>>>>>> +
>>>>>>>> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>>>>>>>> +	if (!pages)
>>>>>>>> +		return -ENOMEM;
>>>>>>>> +
>>>>>>>> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
>>>>>>>> +	if (npinned != npages)
>>>>>>>> +		goto err;
>>>>>>>> +
>>>>>>> As I said I have doubts about the whole approach, but this
>>>>>>> implementation in particular isn't a good idea
>>>>>>> as it keeps the page around forever.
>>>> The pages wil be released during set features.
>>>>
>>>>
>>>>>>> So no THP, no NUMA rebalancing,
>>>> For THP, we will probably miss 2 or 4 pages, but does this really matter
>>>> consider the gain we have?
>>> We as in vhost? networking isn't the only thing guest does.
>>> We don't even know if this guest does a lot of networking.
>>> You don't
>>> know what else is in this huge page. Can be something very important
>>> that guest touches all the time.
>>
>> Well, the probability should be very small consider we usually give several
>> gigabytes to guest. The rest of the pages that doesn't sit in the same
>> hugepage with metadata can still be merged by THP.  Anyway, I can test the
>> differences.
> Thanks!
>
>>>> For NUMA rebalancing, I'm even not quite sure if
>>>> it can helps for the case of IPC (vhost). It looks to me the worst case it
>>>> may cause page to be thrash between nodes if vhost and userspace are running
>>>> in two nodes.
>>> So again it's a gain for vhost but has a completely unpredictable effect on
>>> other functionality of the guest.
>>>
>>> That's what bothers me with this approach.
>>
>> So:
>>
>> - The rest of the pages could still be balanced to other nodes, no?
>>
>> - try to balance metadata pages (belongs to co-operate processes) itself is
>> still questionable
> I am not sure why. It should be easy enough to force the VCPU and vhost
> to move (e.g. start them pinned to 1 cpu, then pin them to another one).
> Clearly sometimes this would be necessary for load balancing reasons.


Yes, but it looks to me the part of motivation of auto NUMA is to avoid 
manual pinning.


> With autonuma after a while (could take seconds but it will happen) the
> memory will migrate.
>

Yes. As you mentioned during the discuss, I wonder we could do it 
similarly through mmu notifier like APIC access page in commit 
c24ae0dcd3e ("kvm: x86: Unpin and remove kvm_arch->apic_access_page")


>
>
>>>
>>>
>>>
>>>>>> This is the price of all GUP users not only vhost itself.
>>>>> Yes. GUP is just not a great interface for vhost to use.
>>>> Zerocopy codes (enabled by defualt) use them for years.
>>> But only for TX and temporarily. We pin, read, unpin.
>>
>> Probably not. For several reasons that the page will be not be released soon
>> or held for a very long period of time or even forever.
>
> With zero copy? Well it's pinned until transmit. Takes a while
> but could be enough for autocopy to work esp since
> its the packet memory so not reused immediately.
>
>>> Your patch is different
>>>
>>> - it writes into memory and GUP has known issues with file
>>>     backed memory
>>
>> The ordinary user for vhost is anonymous pages I think?
>
> It's not the most common scenario and not the fastest one
> (e.g. THP does not work) but file backed is useful sometimes.
> It would not be nice at all to corrupt guest memory in that case.


Ok.


>
>>> - it keeps pages pinned forever
>>>
>>>
>>>
>>>>>> What's more
>>>>>> important, the goal is not to be left too much behind for other backends
>>>>>> like DPDK or AF_XDP (all of which are using GUP).
>>>>> So these guys assume userspace knows what it's doing.
>>>>> We can't assume that.
>>>> What kind of assumption do you they have?
>>>>
>>>>
>>>>>>> userspace-controlled
>>>>>>> amount of memory locked up and not accounted for.
>>>>>> It's pretty easy to add this since the slow path was still kept. If we
>>>>>> exceeds the limitation, we can switch back to slow path.
>>>>>>
>>>>>>> Don't get me wrong it's a great patch in an ideal world.
>>>>>>> But then in an ideal world no barriers smap etc are necessary at all.
>>>>>> Again, this is only for metadata accessing not the data which has been used
>>>>>> for years for real use cases.
>>>>>>
>>>>>> For SMAP, it makes senses for the address that kernel can not forcast. But
>>>>>> it's not the case for the vhost metadata since we know the address will be
>>>>>> accessed very frequently. For speculation barrier, it helps nothing for the
>>>>>> data path of vhost which is a kthread.
>>>>> I don't see how a kthread makes any difference. We do have a validation
>>>>> step which makes some difference.
>>>> The problem is not kthread but the address of userspace address. The
>>>> addresses of vq metadata tends to be consistent for a while, and vhost knows
>>>> they will be frequently. SMAP doesn't help too much in this case.
>>>>
>>>> Thanks.
>>> It's true for a real life applications but a malicious one
>>> can call the setup ioctls any number of times. And SMAP is
>>> all about malcious applications.
>>
>> We don't do this in the path of ioctl, there's no context switch between
>> userspace and kernel in the worker thread. SMAP is used to prevent kernel
>> from accessing userspace pages unexpectedly which is not the case for
>> metadata access.
>>
>> Thanks
> OK let's forget smap for now.


Some numbers I measured:

On an old Sandy bridge machine without SMAP support. Remove speculation 
barrier boost the performance from 4.6Mpps to 5.1Mpps

On a newer Broadwell machine with SMAP support. Remove speculation 
barrier only gives 2%-5% improvement, disable SMAP completely through 
Kconfig boost 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps 
- 6.1Mpps, it only bypass SMAP for metadata).

So it looks like for recent machine, SMAP becomes pain point when the 
copy is short (e.g 64B) for high PPS.

Thanks


>
>>>>>> Packet or AF_XDP benefit from
>>>>>> accessing metadata directly, we should do it as well.
>>>>>>
>>>>>> Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* CISTI'2019 - Doctoral Symposium | Coimbra, Portugal
From: Maria Lemos @ 2018-12-25 18:02 UTC (permalink / raw)
  To: virtualization


[-- Attachment #1.1: Type: text/plain, Size: 5343 bytes --]

* Proceedings published in IEEE Xplore and indexed by ISI, Scopus, etc.


---------------------------------------------------------------------------------------------------------------------------
Doctoral Symposium of CISTI'2019 - 14th Iberian Conference on Information Systems and Technologies
                                                   Coimbra, Portugal, 19 - 22 June 2019
                                                               http://www.cisti.eu/ <http://www.cisti.eu/>
------------------------------------------------------------------------------------------------------------------------------------



The purpose of CISTI'2019’s Doctoral Symposium is to provide graduate students a setting where they can, informally, expose and discuss their work, collecting valuable expert opinions and sharing new ideas, methods and applications. The Doctoral Symposium is an excellent opportunity for PhD students to present and discuss their work in a Workshop format. Each presentation will be evaluated by a panel composed by at least three Information Systems and Technologies experts.



Contributions Submission

The Doctoral Symposium is opened to PhD students whose research area includes the themes proposed for this Conference. Submissions must include an extended abstract (maximum 4 pages), following the Conference style guide <http://cisti.eu/2017/images/templates.zip>. All selected contributions will be handed out along with the Conference Proceedings, in CD with an ISBN. These contributions will be available in the IEEE Xplore <https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=8390719> Digital Library and will be sent for indexing in ISI, Scopus, EI-Compendex, INSPEC and Google Scholar.

Submissions must include the field, the PhD institution and the number of months devoted to the development of the work. Additionally, they should include in a clear and succinct manner:

    •    The problem approached and its significance or relevance
    •    The research objectives and related investigation topics
    •    A brief display of what is already known
    •    A proposed solution methodology for the problem
    •    Expected results



Important Dates

Paper submission: February 10, 2019

Notification of acceptance: March 17, 2019

Submission of accepted papers: March 31, 2019

Payment of registration, to ensure the inclusion of an accepted paper in the conference proceedings: April 1, 2019



Organizing Committee

Álvaro Rocha, Universidade de Coimbra

Manuel Pérez Cota, Universidad de Vigo



Scientific Committee

Manuel Pérez Cota, Universidad de Vigo (Chair)

A. Augusto Sousa, FEUP, Universidade do Porto

Adolfo Lozano Tello, Universidad de Extremadura

Alma María Gómez Rodríguez, Universidade de Vigo

Álvaro Rocha, Universidade de Coimbra

Ana Amélia Carvalho, Universidade de Coimbra

Ana Maria Ramalho Correia, NOVA Information Management School

António Coelho, FEUP, Universidade do Porto

Antonio Garcia-Loureiro, Universidad de Santiago de Compostela

Arnaldo Martins, Universidade de Aveiro

Arturo Méndez Penín, Universidade de Vigo

Bráulio Alturas, ISCTE - Insituto Universitário de Lisboa

Carlos Costa, ISEG, Universidade de Lisboa

Carlos Ferrás Sexto, Universidad de Santiago de Compostela

David Fonseca, La Salle, Universitat Ramon Llull

Ernest Redondo, Universidad Politécnica de Catalunya

Fernando Moreira, Universidade Portucalense

Fernando Ramos, Universidade de Aveiro

Francisco Restivo, Universidade Católica Portuguesa

Gonçalo Paiva Dias, Universidade de Aveiro

Gonzalo Cuevas Agustin, Universidad Politécnica de Madrid

Guilhermina Maria Lobato de Miranda, IE, Universidade de Lisboa

João Costa, Universidade de Coimbra

João Manuel R.S. Tavares, FEUP, Universidade do Porto

José Antonio Calvo-Manzano Villalón, Universidad Politécnica de Madrid

José Borbinha, IST, Universidade de Lisboa

José Machado, Universidade do Minho

José Martins, Universidade de Trás-os-Montes e Alto Douro

Juan de Dios Murillo, Universidad Nacional de Costa Rica

Leandro Rodríguez Linares, Universidade de Vigo

Luciano Boquete, Universidad de Alcalá

Luis Camarinha Matos, Universidade Nova de Lisboa

Luis Macedo, Universidade de Coimbra

Luís Paulo Reis, FEUP, Universidade do Porto

Marco Painho, NOVA Information Management School

Mareca López María Pilar, Universidad Politécnica de Madrid

María José Lado Touriño, Universidade de Vigo

Mário Piattini, Universidad de Castilla-La Mancha

Mário Rela, Universidade de Coimbra

Martin Llamas-Nistal, Universidad de Vigo

Miguel Ramón González Castro, Ence, Energía y Celulosa

Nelson Rocha, Universidade de Aveiro

Paulo Pinto, Univesidade Nova de Lisboa

Óscar Mealha, Universidade de Aveiro

Ramiro Gonçalves, Universidade de Trás-os-Montes e Alto Douro

Vitor Santos, NOVA Information Management School

Yolanda García Vázquez, Universidad de Santiago de Compostela




Doctoral Symposium webpage: https://goo.gl/JTrcLB


Kind regards,

CISTI'2019 Team
http://www.cisti.eu/ <http://www.cisti.eu/>






---
This email has been checked for viruses by AVG.
https://www.avg.com

[-- Attachment #1.2: Type: text/html, Size: 9298 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net V2 4/4] vhost: log dirty page correctly
From: Michael S. Tsirkin @ 2018-12-25 16:25 UTC (permalink / raw)
  To: Jason Wang; +Cc: Jintack Lim, netdev, linux-kernel, kvm, virtualization
In-Reply-To: <9e57732f-2d42-173f-9297-42821f34ab8f@redhat.com>

On Tue, Dec 25, 2018 at 05:43:25PM +0800, Jason Wang wrote:
> 
> On 2018/12/25 上午1:41, Michael S. Tsirkin wrote:
> > On Mon, Dec 24, 2018 at 11:43:31AM +0800, Jason Wang wrote:
> > > On 2018/12/14 下午9:20, Michael S. Tsirkin wrote:
> > > > On Fri, Dec 14, 2018 at 10:43:03AM +0800, Jason Wang wrote:
> > > > > On 2018/12/13 下午10:31, Michael S. Tsirkin wrote:
> > > > > > > Just to make sure I understand this. It looks to me we should:
> > > > > > > 
> > > > > > > - allow passing GIOVA->GPA through UAPI
> > > > > > > 
> > > > > > > - cache GIOVA->GPA somewhere but still use GIOVA->HVA in device IOTLB for
> > > > > > > performance
> > > > > > > 
> > > > > > > Is this what you suggest?
> > > > > > > 
> > > > > > > Thanks
> > > > > > Not really. We already have GPA->HVA, so I suggested a flag to pass
> > > > > > GIOVA->GPA in the IOTLB.
> > > > > > 
> > > > > > This has advantages for security since a single table needs
> > > > > > then to be validated to ensure guest does not corrupt
> > > > > > QEMU memory.
> > > > > > 
> > > > > I wonder how much we can gain through this. Currently, qemu IOMMU gives
> > > > > GIOVA->GPA mapping, and qemu vhost code will translate GPA to HVA then pass
> > > > > GIOVA->HVA to vhost. It looks no difference to me.
> > > > > 
> > > > > Thanks
> > > > The difference is in security not in performance.  Getting a bad HVA
> > > > corrupts QEMU memory and it might be guest controlled. Very risky.
> > > How can this be controlled by guest? HVA was generated from qemu ram blocks
> > > which is totally under the control of qemu memory core instead of guest.
> > > 
> > > 
> > > Thanks
> > It is ultimately under guest influence as guest supplies IOVA->GPA
> > translations.  qemu translates GPA->HVA and gives the translated result
> > to the kernel.  If it's not buggy and kernel isn't buggy it's all
> > fine.
> 
> 
> If qemu provides buggy GPA->HVA, we can't workaround this. And I don't get
> the point why we even want to try this. Buggy qemu code can crash itself in
> many ways.
> 
> 
> > 
> > But that's the approach that was proven not to work in the 20th century.
> > In the 21st century we are trying defence in depth approach.
> > 
> > My point is that a single code path that is responsible for
> > the HVA translations is better than two.
> > 
> 
> So the difference whether or not use memory table information:
> 
> Current:
> 
> 1) SET_MEM_TABLE: GPA->HVA
> 
> 2) Qemu GIOVA->GPA
> 
> 3) Qemu GPA->HVA
> 
> 4) IOTLB_UPDATE: GIOVA->HVA
> 
> If I understand correctly you want to drop step 3 consider it might be buggy
> which is just 19 lines of code in qemu (vhost_memory_region_lookup()). This
> will ends up:
> 
> 1) Do GPA->HVA translation in IOTLB_UPDATE path (I believe we won't want to
> do it during device IOTLB lookup).
> 
> 2) Extra bits to enable this capability.
> 
> So this looks need more codes in kernel than what qemu did in userspace.  Is
> this really worthwhile?
> 
> Thanks

So there are several points I would like to make

1. At the moment without an iommu it is possible to
   change GPA-HVA mappings and everything keeps working
   because a change in memory tables flushes the rings.
   However I don't see the iotlb cache being invalidated
   on that path - did I miss it? If it is not there it's
   a related minor bug.

2. qemu already has a GPA. Discarding it and re-calculating
   when logging is on just seems wrong.
   However if you would like to *also* keep the HVA in the iotlb
   to avoid doing extra translations, that sounds like a
   reasonable optimization.

3. it also means that the hva->gpa translation only runs
   when logging is enabled. That is a rarely excercised
   path so any bugs there will not be caught.

So I really would like us long term to move away from
hva->gpa translations, keep them for legacy userspace only
but I don't really mind how we do it.

How about
- a new flag to pass an iotlb with *both* a gpa and hva
- for legacy userspace, calculate the gpa on iotlb update
  so the device then uses a shared code path

what do you think?


-- 
MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
From: Michael S. Tsirkin @ 2018-12-25 12:52 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <51fa034d-99ae-3820-c3a4-d9e6f2eefe34@redhat.com>

On Tue, Dec 25, 2018 at 06:09:19PM +0800, Jason Wang wrote:
> 
> On 2018/12/25 上午2:12, Michael S. Tsirkin wrote:
> > On Mon, Dec 24, 2018 at 04:32:39PM +0800, Jason Wang wrote:
> > > On 2018/12/14 下午8:33, Michael S. Tsirkin wrote:
> > > > On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
> > > > > On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
> > > > > > On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> > > > > > > Hi:
> > > > > > > 
> > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > toggling.
> > > > > > Userspace accesses through remapping tricks and next time there's a need
> > > > > > for a new barrier we are left to figure it out by ourselves.
> > > > > I don't get here, do you mean spec barriers?
> > > > I mean the next barrier people decide to put into userspace
> > > > memory accesses.
> > > > 
> > > > > It's completely unnecessary for
> > > > > vhost which is kernel thread.
> > > > It's defence in depth. Take a look at the commit that added them.
> > > > And yes quite possibly in most cases we actually have a spec
> > > > barrier in the validation phase. If we do let's use the
> > > > unsafe variants so they can be found.
> > > 
> > > unsafe variants can only work if you can batch userspace access. This is not
> > > necessarily the case for light load.
> > 
> > Do we care a lot about the light load? How would you benchmark it?
> > 
> 
> If we can reduce the latency that's will be more than what we expect.
> 
> 1 byte TCP_RR shows 1.5%-2% improvement.

It's nice but not great. E.g. adaptive polling would be
a better approach to work on latency imho.

> 
> > > > > And even if you're right, vhost is not the
> > > > > only place, there's lots of vmap() based accessing in kernel.
> > > > For sure. But if one can get by without get user pages, one
> > > > really should. Witness recently uncovered mess with file
> > > > backed storage.
> > > 
> > > We only pin metadata pages, I don't believe they will be used by any DMA.
> > It doesn't matter really, if you dirty pages behind the MM back
> > the problem is there.
> 
> 
> Ok, but the usual case is anonymous pages, do we use file backed pages for
> user of vhost?

Some people use file backed pages for vms.
Nothing prevents them from using vhost as well.

> And even if we use sometime, according to the pointer it's
> not something that can fix, RFC has been posted to solve this issue.
> 
> Thanks

Except it's not broken if we don't to gup + write.
So yea, wait for rfc to be merged.

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
From: Michael S. Tsirkin @ 2018-12-25 12:50 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <b31dbf4c-d499-1d34-f39f-f10cc0febc84@redhat.com>

On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
> 
> On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
> > On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
> > > On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
> > > > On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
> > > > > On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
> > > > > > On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
> > > > > > > It was noticed that the copy_user() friends that was used to access
> > > > > > > virtqueue metdata tends to be very expensive for dataplane
> > > > > > > implementation like vhost since it involves lots of software check,
> > > > > > > speculation barrier, hardware feature toggling (e.g SMAP). The
> > > > > > > extra cost will be more obvious when transferring small packets.
> > > > > > > 
> > > > > > > This patch tries to eliminate those overhead by pin vq metadata pages
> > > > > > > and access them through vmap(). During SET_VRING_ADDR, we will setup
> > > > > > > those mappings and memory accessors are modified to use pointers to
> > > > > > > access the metadata directly.
> > > > > > > 
> > > > > > > Note, this was only done when device IOTLB is not enabled. We could
> > > > > > > use similar method to optimize it in the future.
> > > > > > > 
> > > > > > > Tests shows about ~24% improvement on TX PPS when using virtio-user +
> > > > > > > vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
> > > > > > > 
> > > > > > > Before: ~5.0Mpps
> > > > > > > After:  ~6.1Mpps
> > > > > > > 
> > > > > > > Signed-off-by: Jason Wang<jasowang@redhat.com>
> > > > > > > ---
> > > > > > >     drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
> > > > > > >     drivers/vhost/vhost.h |  11 +++
> > > > > > >     2 files changed, 189 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > index bafe39d2e637..1bd24203afb6 100644
> > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
> > > > > > >     		vq->indirect = NULL;
> > > > > > >     		vq->heads = NULL;
> > > > > > >     		vq->dev = dev;
> > > > > > > +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
> > > > > > > +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
> > > > > > > +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
> > > > > > >     		mutex_init(&vq->mutex);
> > > > > > >     		vhost_vq_reset(dev, vq);
> > > > > > >     		if (vq->handle_kick)
> > > > > > > @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
> > > > > > >     	spin_unlock(&dev->iotlb_lock);
> > > > > > >     }
> > > > > > > +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
> > > > > > > +			   size_t size, int write)
> > > > > > > +{
> > > > > > > +	struct page **pages;
> > > > > > > +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
> > > > > > > +	int npinned;
> > > > > > > +	void *vaddr;
> > > > > > > +
> > > > > > > +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> > > > > > > +	if (!pages)
> > > > > > > +		return -ENOMEM;
> > > > > > > +
> > > > > > > +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
> > > > > > > +	if (npinned != npages)
> > > > > > > +		goto err;
> > > > > > > +
> > > > > > As I said I have doubts about the whole approach, but this
> > > > > > implementation in particular isn't a good idea
> > > > > > as it keeps the page around forever.
> > > 
> > > The pages wil be released during set features.
> > > 
> > > 
> > > > > > So no THP, no NUMA rebalancing,
> > > 
> > > For THP, we will probably miss 2 or 4 pages, but does this really matter
> > > consider the gain we have?
> > We as in vhost? networking isn't the only thing guest does.
> > We don't even know if this guest does a lot of networking.
> > You don't
> > know what else is in this huge page. Can be something very important
> > that guest touches all the time.
> 
> 
> Well, the probability should be very small consider we usually give several
> gigabytes to guest. The rest of the pages that doesn't sit in the same
> hugepage with metadata can still be merged by THP.  Anyway, I can test the
> differences.

Thanks!

> 
> > 
> > > For NUMA rebalancing, I'm even not quite sure if
> > > it can helps for the case of IPC (vhost). It looks to me the worst case it
> > > may cause page to be thrash between nodes if vhost and userspace are running
> > > in two nodes.
> > 
> > So again it's a gain for vhost but has a completely unpredictable effect on
> > other functionality of the guest.
> > 
> > That's what bothers me with this approach.
> 
> 
> So:
> 
> - The rest of the pages could still be balanced to other nodes, no?
> 
> - try to balance metadata pages (belongs to co-operate processes) itself is
> still questionable

I am not sure why. It should be easy enough to force the VCPU and vhost
to move (e.g. start them pinned to 1 cpu, then pin them to another one).
Clearly sometimes this would be necessary for load balancing reasons.
With autonuma after a while (could take seconds but it will happen) the
memory will migrate.




> 
> > 
> > 
> > 
> > 
> > > > > This is the price of all GUP users not only vhost itself.
> > > > Yes. GUP is just not a great interface for vhost to use.
> > > 
> > > Zerocopy codes (enabled by defualt) use them for years.
> > But only for TX and temporarily. We pin, read, unpin.
> 
> 
> Probably not. For several reasons that the page will be not be released soon
> or held for a very long period of time or even forever.


With zero copy? Well it's pinned until transmit. Takes a while
but could be enough for autocopy to work esp since
its the packet memory so not reused immediately.

> 
> > 
> > Your patch is different
> > 
> > - it writes into memory and GUP has known issues with file
> >    backed memory
> 
> 
> The ordinary user for vhost is anonymous pages I think?


It's not the most common scenario and not the fastest one
(e.g. THP does not work) but file backed is useful sometimes.
It would not be nice at all to corrupt guest memory in that case.


> 
> > - it keeps pages pinned forever
> > 
> > 
> > 
> > > > > What's more
> > > > > important, the goal is not to be left too much behind for other backends
> > > > > like DPDK or AF_XDP (all of which are using GUP).
> > > > So these guys assume userspace knows what it's doing.
> > > > We can't assume that.
> > > 
> > > What kind of assumption do you they have?
> > > 
> > > 
> > > > > > userspace-controlled
> > > > > > amount of memory locked up and not accounted for.
> > > > > It's pretty easy to add this since the slow path was still kept. If we
> > > > > exceeds the limitation, we can switch back to slow path.
> > > > > 
> > > > > > Don't get me wrong it's a great patch in an ideal world.
> > > > > > But then in an ideal world no barriers smap etc are necessary at all.
> > > > > Again, this is only for metadata accessing not the data which has been used
> > > > > for years for real use cases.
> > > > > 
> > > > > For SMAP, it makes senses for the address that kernel can not forcast. But
> > > > > it's not the case for the vhost metadata since we know the address will be
> > > > > accessed very frequently. For speculation barrier, it helps nothing for the
> > > > > data path of vhost which is a kthread.
> > > > I don't see how a kthread makes any difference. We do have a validation
> > > > step which makes some difference.
> > > 
> > > The problem is not kthread but the address of userspace address. The
> > > addresses of vq metadata tends to be consistent for a while, and vhost knows
> > > they will be frequently. SMAP doesn't help too much in this case.
> > > 
> > > Thanks.
> > It's true for a real life applications but a malicious one
> > can call the setup ioctls any number of times. And SMAP is
> > all about malcious applications.
> 
> 
> We don't do this in the path of ioctl, there's no context switch between
> userspace and kernel in the worker thread. SMAP is used to prevent kernel
> from accessing userspace pages unexpectedly which is not the case for
> metadata access.
> 
> Thanks

OK let's forget smap for now.

> 
> > 
> > > > > Packet or AF_XDP benefit from
> > > > > accessing metadata directly, we should do it as well.
> > > > > 
> > > > > Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
From: Jason Wang @ 2018-12-25 10:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181224131040-mutt-send-email-mst@kernel.org>


On 2018/12/25 上午2:12, Michael S. Tsirkin wrote:
> On Mon, Dec 24, 2018 at 04:32:39PM +0800, Jason Wang wrote:
>> On 2018/12/14 下午8:33, Michael S. Tsirkin wrote:
>>> On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
>>>> On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
>>>>> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>>>>>> Hi:
>>>>>>
>>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>>> address instead of copy_user() friends since they had too much
>>>>>> overheads like checks, spec barriers or even hardware feature
>>>>>> toggling.
>>>>> Userspace accesses through remapping tricks and next time there's a need
>>>>> for a new barrier we are left to figure it out by ourselves.
>>>> I don't get here, do you mean spec barriers?
>>> I mean the next barrier people decide to put into userspace
>>> memory accesses.
>>>
>>>> It's completely unnecessary for
>>>> vhost which is kernel thread.
>>> It's defence in depth. Take a look at the commit that added them.
>>> And yes quite possibly in most cases we actually have a spec
>>> barrier in the validation phase. If we do let's use the
>>> unsafe variants so they can be found.
>>
>> unsafe variants can only work if you can batch userspace access. This is not
>> necessarily the case for light load.
>
> Do we care a lot about the light load? How would you benchmark it?
>

If we can reduce the latency that's will be more than what we expect.

1 byte TCP_RR shows 1.5%-2% improvement.


>>>> And even if you're right, vhost is not the
>>>> only place, there's lots of vmap() based accessing in kernel.
>>> For sure. But if one can get by without get user pages, one
>>> really should. Witness recently uncovered mess with file
>>> backed storage.
>>
>> We only pin metadata pages, I don't believe they will be used by any DMA.
> It doesn't matter really, if you dirty pages behind the MM back
> the problem is there.


Ok, but the usual case is anonymous pages, do we use file backed pages 
for user of vhost? And even if we use sometime, according to the pointer 
it's not something that can fix, RFC has been posted to solve this issue.

Thanks


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
From: Jason Wang @ 2018-12-25 10:05 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181224125237-mutt-send-email-mst@kernel.org>


On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
> On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
>> On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
>>> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
>>>> On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
>>>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>>>>>> It was noticed that the copy_user() friends that was used to access
>>>>>> virtqueue metdata tends to be very expensive for dataplane
>>>>>> implementation like vhost since it involves lots of software check,
>>>>>> speculation barrier, hardware feature toggling (e.g SMAP). The
>>>>>> extra cost will be more obvious when transferring small packets.
>>>>>>
>>>>>> This patch tries to eliminate those overhead by pin vq metadata pages
>>>>>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>>>>>> those mappings and memory accessors are modified to use pointers to
>>>>>> access the metadata directly.
>>>>>>
>>>>>> Note, this was only done when device IOTLB is not enabled. We could
>>>>>> use similar method to optimize it in the future.
>>>>>>
>>>>>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>>>>>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>>>>>
>>>>>> Before: ~5.0Mpps
>>>>>> After:  ~6.1Mpps
>>>>>>
>>>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>>>> ---
>>>>>>     drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>     drivers/vhost/vhost.h |  11 +++
>>>>>>     2 files changed, 189 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>> index bafe39d2e637..1bd24203afb6 100644
>>>>>> --- a/drivers/vhost/vhost.c
>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>>>>     		vq->indirect = NULL;
>>>>>>     		vq->heads = NULL;
>>>>>>     		vq->dev = dev;
>>>>>> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>>>>>> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>>>>>> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>>>>>     		mutex_init(&vq->mutex);
>>>>>>     		vhost_vq_reset(dev, vq);
>>>>>>     		if (vq->handle_kick)
>>>>>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>>>>>     	spin_unlock(&dev->iotlb_lock);
>>>>>>     }
>>>>>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>>>>>> +			   size_t size, int write)
>>>>>> +{
>>>>>> +	struct page **pages;
>>>>>> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>>>>>> +	int npinned;
>>>>>> +	void *vaddr;
>>>>>> +
>>>>>> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>>>>>> +	if (!pages)
>>>>>> +		return -ENOMEM;
>>>>>> +
>>>>>> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
>>>>>> +	if (npinned != npages)
>>>>>> +		goto err;
>>>>>> +
>>>>> As I said I have doubts about the whole approach, but this
>>>>> implementation in particular isn't a good idea
>>>>> as it keeps the page around forever.
>>
>> The pages wil be released during set features.
>>
>>
>>>>> So no THP, no NUMA rebalancing,
>>
>> For THP, we will probably miss 2 or 4 pages, but does this really matter
>> consider the gain we have?
> We as in vhost? networking isn't the only thing guest does.
> We don't even know if this guest does a lot of networking.
> You don't
> know what else is in this huge page. Can be something very important
> that guest touches all the time.


Well, the probability should be very small consider we usually give 
several gigabytes to guest. The rest of the pages that doesn't sit in 
the same hugepage with metadata can still be merged by THP.  Anyway, I 
can test the differences.


>
>> For NUMA rebalancing, I'm even not quite sure if
>> it can helps for the case of IPC (vhost). It looks to me the worst case it
>> may cause page to be thrash between nodes if vhost and userspace are running
>> in two nodes.
>
> So again it's a gain for vhost but has a completely unpredictable effect on
> other functionality of the guest.
>
> That's what bothers me with this approach.


So:

- The rest of the pages could still be balanced to other nodes, no?

- try to balance metadata pages (belongs to co-operate processes) itself 
is still questionable


>
>
>
>
>>>> This is the price of all GUP users not only vhost itself.
>>> Yes. GUP is just not a great interface for vhost to use.
>>
>> Zerocopy codes (enabled by defualt) use them for years.
> But only for TX and temporarily. We pin, read, unpin.


Probably not. For several reasons that the page will be not be released 
soon or held for a very long period of time or even forever.


>
> Your patch is different
>
> - it writes into memory and GUP has known issues with file
>    backed memory


The ordinary user for vhost is anonymous pages I think?


> - it keeps pages pinned forever
>
>
>
>>>> What's more
>>>> important, the goal is not to be left too much behind for other backends
>>>> like DPDK or AF_XDP (all of which are using GUP).
>>> So these guys assume userspace knows what it's doing.
>>> We can't assume that.
>>
>> What kind of assumption do you they have?
>>
>>
>>>>> userspace-controlled
>>>>> amount of memory locked up and not accounted for.
>>>> It's pretty easy to add this since the slow path was still kept. If we
>>>> exceeds the limitation, we can switch back to slow path.
>>>>
>>>>> Don't get me wrong it's a great patch in an ideal world.
>>>>> But then in an ideal world no barriers smap etc are necessary at all.
>>>> Again, this is only for metadata accessing not the data which has been used
>>>> for years for real use cases.
>>>>
>>>> For SMAP, it makes senses for the address that kernel can not forcast. But
>>>> it's not the case for the vhost metadata since we know the address will be
>>>> accessed very frequently. For speculation barrier, it helps nothing for the
>>>> data path of vhost which is a kthread.
>>> I don't see how a kthread makes any difference. We do have a validation
>>> step which makes some difference.
>>
>> The problem is not kthread but the address of userspace address. The
>> addresses of vq metadata tends to be consistent for a while, and vhost knows
>> they will be frequently. SMAP doesn't help too much in this case.
>>
>> Thanks.
> It's true for a real life applications but a malicious one
> can call the setup ioctls any number of times. And SMAP is
> all about malcious applications.


We don't do this in the path of ioctl, there's no context switch between 
userspace and kernel in the worker thread. SMAP is used to prevent 
kernel from accessing userspace pages unexpectedly which is not the case 
for metadata access.

Thanks


>
>>>> Packet or AF_XDP benefit from
>>>> accessing metadata directly, we should do it as well.
>>>>
>>>> Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox