Linux virtualization list
 help / color / mirror / Atom feed
* [PATCH] crypto: virtio: clean up indentation, replace spaces with tab
From: Colin King @ 2018-12-30 13:46 UTC (permalink / raw)
  To: Gonglei, Michael S . Tsirkin, Jason Wang, Herbert Xu,
	David S . Miller, virtualization, linux-crypto
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

A statement is indented with spaces and not indented enough, fix this
replacing spaces with a tab.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/crypto/virtio/virtio_crypto_algs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c b/drivers/crypto/virtio/virtio_crypto_algs.c
index 2c573d1aaa64..0704833ece92 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -406,7 +406,7 @@ __virtio_crypto_ablkcipher_do_req(struct virtio_crypto_sym_request *vc_sym_req,
 	} else {
 		req_data->header.session_id =
 			cpu_to_le64(ctx->dec_sess_info.session_id);
-	    req_data->header.opcode =
+		req_data->header.opcode =
 			cpu_to_le32(VIRTIO_CRYPTO_CIPHER_DECRYPT);
 	}
 	req_data->u.sym_req.op_type = cpu_to_le32(VIRTIO_CRYPTO_SYM_OP_CIPHER);
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
From: Michael S. Tsirkin @ 2018-12-30 18:30 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <8ef53a5c-ad4e-fadd-b460-18b3e589ead9@redhat.com>

On Thu, Dec 27, 2018 at 05:39:21PM +0800, Jason Wang wrote:
> 
> On 2018/12/26 下午11:02, Michael S. Tsirkin wrote:
> > On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote:
> > > On 2018/12/25 下午8:50, Michael S. Tsirkin wrote:
> > > > On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
> > > > > On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
> > > > > > On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
> > > > > > > On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
> > > > > > > > On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
> > > > > > > > > On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
> > > > > > > > > > On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
> > > > > > > > > > > It was noticed that the copy_user() friends that was used to access
> > > > > > > > > > > virtqueue metdata tends to be very expensive for dataplane
> > > > > > > > > > > implementation like vhost since it involves lots of software check,
> > > > > > > > > > > speculation barrier, hardware feature toggling (e.g SMAP). The
> > > > > > > > > > > extra cost will be more obvious when transferring small packets.
> > > > > > > > > > > 
> > > > > > > > > > > This patch tries to eliminate those overhead by pin vq metadata pages
> > > > > > > > > > > and access them through vmap(). During SET_VRING_ADDR, we will setup
> > > > > > > > > > > those mappings and memory accessors are modified to use pointers to
> > > > > > > > > > > access the metadata directly.
> > > > > > > > > > > 
> > > > > > > > > > > Note, this was only done when device IOTLB is not enabled. We could
> > > > > > > > > > > use similar method to optimize it in the future.
> > > > > > > > > > > 
> > > > > > > > > > > Tests shows about ~24% improvement on TX PPS when using virtio-user +
> > > > > > > > > > > vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
> > > > > > > > > > > 
> > > > > > > > > > > Before: ~5.0Mpps
> > > > > > > > > > > After:  ~6.1Mpps
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Jason Wang<jasowang@redhat.com>
> > > > > > > > > > > ---
> > > > > > > > > > >       drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > > >       drivers/vhost/vhost.h |  11 +++
> > > > > > > > > > >       2 files changed, 189 insertions(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > > > > > index bafe39d2e637..1bd24203afb6 100644
> > > > > > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > > > > > @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
> > > > > > > > > > >       		vq->indirect = NULL;
> > > > > > > > > > >       		vq->heads = NULL;
> > > > > > > > > > >       		vq->dev = dev;
> > > > > > > > > > > +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
> > > > > > > > > > > +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
> > > > > > > > > > > +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
> > > > > > > > > > >       		mutex_init(&vq->mutex);
> > > > > > > > > > >       		vhost_vq_reset(dev, vq);
> > > > > > > > > > >       		if (vq->handle_kick)
> > > > > > > > > > > @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
> > > > > > > > > > >       	spin_unlock(&dev->iotlb_lock);
> > > > > > > > > > >       }
> > > > > > > > > > > +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
> > > > > > > > > > > +			   size_t size, int write)
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct page **pages;
> > > > > > > > > > > +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
> > > > > > > > > > > +	int npinned;
> > > > > > > > > > > +	void *vaddr;
> > > > > > > > > > > +
> > > > > > > > > > > +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> > > > > > > > > > > +	if (!pages)
> > > > > > > > > > > +		return -ENOMEM;
> > > > > > > > > > > +
> > > > > > > > > > > +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
> > > > > > > > > > > +	if (npinned != npages)
> > > > > > > > > > > +		goto err;
> > > > > > > > > > > +
> > > > > > > > > > As I said I have doubts about the whole approach, but this
> > > > > > > > > > implementation in particular isn't a good idea
> > > > > > > > > > as it keeps the page around forever.
> > > > > > > The pages wil be released during set features.
> > > > > > > 
> > > > > > > 
> > > > > > > > > > So no THP, no NUMA rebalancing,
> > > > > > > For THP, we will probably miss 2 or 4 pages, but does this really matter
> > > > > > > consider the gain we have?
> > > > > > We as in vhost? networking isn't the only thing guest does.
> > > > > > We don't even know if this guest does a lot of networking.
> > > > > > You don't
> > > > > > know what else is in this huge page. Can be something very important
> > > > > > that guest touches all the time.
> > > > > Well, the probability should be very small consider we usually give several
> > > > > gigabytes to guest. The rest of the pages that doesn't sit in the same
> > > > > hugepage with metadata can still be merged by THP.  Anyway, I can test the
> > > > > differences.
> > > > Thanks!
> > > > 
> > > > > > > For NUMA rebalancing, I'm even not quite sure if
> > > > > > > it can helps for the case of IPC (vhost). It looks to me the worst case it
> > > > > > > may cause page to be thrash between nodes if vhost and userspace are running
> > > > > > > in two nodes.
> > > > > > So again it's a gain for vhost but has a completely unpredictable effect on
> > > > > > other functionality of the guest.
> > > > > > 
> > > > > > That's what bothers me with this approach.
> > > > > So:
> > > > > 
> > > > > - The rest of the pages could still be balanced to other nodes, no?
> > > > > 
> > > > > - try to balance metadata pages (belongs to co-operate processes) itself is
> > > > > still questionable
> > > > I am not sure why. It should be easy enough to force the VCPU and vhost
> > > > to move (e.g. start them pinned to 1 cpu, then pin them to another one).
> > > > Clearly sometimes this would be necessary for load balancing reasons.
> > > 
> > > Yes, but it looks to me the part of motivation of auto NUMA is to avoid
> > > manual pinning.
> > ... of memory. Yes.
> > 
> > 
> > > > With autonuma after a while (could take seconds but it will happen) the
> > > > memory will migrate.
> > > > 
> > > Yes. As you mentioned during the discuss, I wonder we could do it similarly
> > > through mmu notifier like APIC access page in commit c24ae0dcd3e ("kvm: x86:
> > > Unpin and remove kvm_arch->apic_access_page")
> > That would be a possible approach.
> 
> 
> Yes, this looks possible, and the conversion seems not hard. Let me have a
> try with this.
> 
> 
> [...]
> 
> 
> > > > > > > > I don't see how a kthread makes any difference. We do have a validation
> > > > > > > > step which makes some difference.
> > > > > > > The problem is not kthread but the address of userspace address. The
> > > > > > > addresses of vq metadata tends to be consistent for a while, and vhost knows
> > > > > > > they will be frequently. SMAP doesn't help too much in this case.
> > > > > > > 
> > > > > > > Thanks.
> > > > > > It's true for a real life applications but a malicious one
> > > > > > can call the setup ioctls any number of times. And SMAP is
> > > > > > all about malcious applications.
> > > > > We don't do this in the path of ioctl, there's no context switch between
> > > > > userspace and kernel in the worker thread. SMAP is used to prevent kernel
> > > > > from accessing userspace pages unexpectedly which is not the case for
> > > > > metadata access.
> > > > > 
> > > > > Thanks
> > > > OK let's forget smap for now.
> > > 
> > > Some numbers I measured:
> > > 
> > > On an old Sandy bridge machine without SMAP support. Remove speculation
> > > barrier boost the performance from 4.6Mpps to 5.1Mpps
> > > 
> > > On a newer Broadwell machine with SMAP support. Remove speculation barrier
> > > only gives 2%-5% improvement, disable SMAP completely through Kconfig boost
> > > 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps, it
> > > only bypass SMAP for metadata).
> > > 
> > > So it looks like for recent machine, SMAP becomes pain point when the copy
> > > is short (e.g 64B) for high PPS.
> > > 
> > > Thanks
> > Thanks a lot for looking into this!
> > 
> > So first of all users can just boot with nosmap, right?
> > What's wrong with that?
> 
> 
> Nothing wrong, just realize we had this kernel parameter.
> 
> 
> >   Yes it's not fine-grained but OTOH
> > it's easy to understand.
> > 
> > And I guess this confirms that if we are going to worry
> > about smap enabled, we need to look into packet copies
> > too, not just meta-data.
> 
> 
> For packet copies, we can do batch copy which is pretty simple for the case
> of XDP. I've already had patches for this.
> 
> 
> > 
> > Vaguely could see a module option (off by default)
> > where vhost basically does user_access_begin
> > when it starts running, then uses unsafe accesses
> > in vhost and tun and then user_access_end.
> 
> 
> Using user_access_begin() is more tricky than imaged. E.g it requires:
> 
> - userspace address to be validated before through access_ok() [1]

This part is fine I think - addresses come from the memory
map and when userspace supplies the memory map
we validate everything with access_ok.
Well do we validate with the iotlb too? Don't see it right now
so maybe not but it's easy to add.

> - It doesn't support calling a function that does explicit schedule since
> SMAP/PAN state is not maintained through schedule() [2]
> 
> [1] https://lwn.net/Articles/736348/
> 
> [2] https://lkml.org/lkml/2018/11/23/430
> 
> So calling user_access_begin() all the time when vhost is running seems
> pretty dangerous.

Yes it requires some rework e.g. to try getting memory with
GFP_ATOMIC. We could then do a slow path with GFP_KERNEL
if that fails.

> For a better batched datacopy, I tend to build not only XDP but also skb in
> vhost in the future.
> 
> Thanks

Sure, why not.

> 
> > 
> > 
> > > > > > > > > Packet or AF_XDP benefit from
> > > > > > > > > accessing metadata directly, we should do it as well.
> > > > > > > > > 
> > > > > > > > > Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio-net: bql support
From: Michael S. Tsirkin @ 2018-12-30 18:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, linux-kernel, virtualization, maxime.coquelin, wexu,
	David S. Miller
In-Reply-To: <0fa99d9b-e510-d7eb-db1b-831bd7610ce9@redhat.com>

On Thu, Dec 27, 2018 at 06:00:36PM +0800, Jason Wang wrote:
> 
> On 2018/12/26 下午11:19, Michael S. Tsirkin wrote:
> > On Thu, Dec 06, 2018 at 04:17:36PM +0800, Jason Wang wrote:
> > > On 2018/12/6 上午6:54, Michael S. Tsirkin wrote:
> > > > When use_napi is set, let's enable BQLs.  Note: some of the issues are
> > > > similar to wifi.  It's worth considering whether something similar to
> > > > commit 36148c2bbfbe ("mac80211: Adjust TSQ pacing shift") might be
> > > > benefitial.
> > > 
> > > I've played a similar patch several days before. The tricky part is the mode
> > > switching between napi and no napi. We should make sure when the packet is
> > > sent and trakced by BQL,  it should be consumed by BQL as well. I did it by
> > > tracking it through skb->cb.  And deal with the freeze by reset the BQL
> > > status. Patch attached.
> > > 
> > > But when testing with vhost-net, I don't very a stable performance,
> > So how about increasing TSQ pacing shift then?
> 
> 
> I can test this. But changing default TCP value is much more than a
> virtio-net specific thing.

Well same logic as wifi applies. Unpredictable latencies related
to radio in one case, to host scheduler in the other.

> 
> > 
> > > it was
> > > probably because we batch the used ring updating so tx interrupt may come
> > > randomly. We probably need to implement time bounded coalescing mechanism
> > > which could be configured from userspace.
> > I don't think it's reasonable to expect userspace to be that smart ...
> > Why do we need time bounded? used ring is always updated when ring
> > becomes empty.
> 
> 
> We don't add used when means BQL may not see the consumed packet in time.
> And the delay varies based on the workload since we count packets not bytes
> or time before doing the batched updating.
> 
> Thanks

Sorry I still don't get it.
When nothing is outstanding then we do update the used.
So if BQL stops userspace from sending packets then
we get an interrupt and packets start flowing again.

It might be suboptimal, we might need to tune it but I doubt running
timers is a solution, timer interrupts cause VM exits.

> 
> > 
> > > Btw, maybe it's time just enable napi TX by default. I get ~10% TCP_RR
> > > regression on machine without APICv, (haven't found time to test APICv
> > > machine). But consider it was for correctness, I think it's acceptable? Then
> > > we can do optimization on top?
> > > 
> > > 
> > > Thanks
> > > 
> > > 
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > ---
> > > >    drivers/net/virtio_net.c | 27 +++++++++++++++++++--------
> > > >    1 file changed, 19 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > index cecfd77c9f3c..b657bde6b94b 100644
> > > > --- a/drivers/net/virtio_net.c
> > > > +++ b/drivers/net/virtio_net.c
> > > > @@ -1325,7 +1325,8 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
> > > >    	return stats.packets;
> > > >    }
> > > > -static void free_old_xmit_skbs(struct send_queue *sq)
> > > > +static void free_old_xmit_skbs(struct send_queue *sq, struct netdev_queue *txq,
> > > > +			       bool use_napi)
> > > >    {
> > > >    	struct sk_buff *skb;
> > > >    	unsigned int len;
> > > > @@ -1347,6 +1348,9 @@ static void free_old_xmit_skbs(struct send_queue *sq)
> > > >    	if (!packets)
> > > >    		return;
> > > > +	if (use_napi)
> > > > +		netdev_tx_completed_queue(txq, packets, bytes);
> > > > +
> > > >    	u64_stats_update_begin(&sq->stats.syncp);
> > > >    	sq->stats.bytes += bytes;
> > > >    	sq->stats.packets += packets;
> > > > @@ -1364,7 +1368,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
> > > >    		return;
> > > >    	if (__netif_tx_trylock(txq)) {
> > > > -		free_old_xmit_skbs(sq);
> > > > +		free_old_xmit_skbs(sq, txq, true);
> > > >    		__netif_tx_unlock(txq);
> > > >    	}
> > > > @@ -1440,7 +1444,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> > > >    	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
> > > >    	__netif_tx_lock(txq, raw_smp_processor_id());
> > > > -	free_old_xmit_skbs(sq);
> > > > +	free_old_xmit_skbs(sq, txq, true);
> > > >    	__netif_tx_unlock(txq);
> > > >    	virtqueue_napi_complete(napi, sq->vq, 0);
> > > > @@ -1505,13 +1509,15 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> > > >    	struct send_queue *sq = &vi->sq[qnum];
> > > >    	int err;
> > > >    	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> > > > -	bool kick = !skb->xmit_more;
> > > > +	bool more = skb->xmit_more;
> > > >    	bool use_napi = sq->napi.weight;
> > > > +	unsigned int bytes = skb->len;
> > > > +	bool kick;
> > > >    	/* Free up any pending old buffers before queueing new ones. */
> > > > -	free_old_xmit_skbs(sq);
> > > > +	free_old_xmit_skbs(sq, txq, use_napi);
> > > > -	if (use_napi && kick)
> > > > +	if (use_napi && !more)
> > > >    		virtqueue_enable_cb_delayed(sq->vq);
> > > >    	/* timestamp packet in software */
> > > > @@ -1552,7 +1558,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> > > >    		if (!use_napi &&
> > > >    		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> > > >    			/* More just got used, free them then recheck. */
> > > > -			free_old_xmit_skbs(sq);
> > > > +			free_old_xmit_skbs(sq, txq, false);
> > > >    			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
> > > >    				netif_start_subqueue(dev, qnum);
> > > >    				virtqueue_disable_cb(sq->vq);
> > > > @@ -1560,7 +1566,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> > > >    		}
> > > >    	}
> > > > -	if (kick || netif_xmit_stopped(txq)) {
> > > > +	if (use_napi)
> > > > +		kick = __netdev_tx_sent_queue(txq, bytes, more);
> > > > +	else
> > > > +		kick = !more || netif_xmit_stopped(txq);
> > > > +
> > > > +	if (kick) {
> > > >    		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq)) {
> > > >    			u64_stats_update_begin(&sq->stats.syncp);
> > > >    			sq->stats.kicks++;
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio-net: bql support
From: Michael S. Tsirkin @ 2018-12-30 18:48 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, linux-kernel, virtualization, maxime.coquelin, wexu,
	David S. Miller
In-Reply-To: <620cfd46-aa3e-7eb6-0757-f4afbafda44b@redhat.com>

On Thu, Dec 27, 2018 at 06:04:53PM +0800, Jason Wang wrote:
> 
> On 2018/12/26 下午11:22, Michael S. Tsirkin wrote:
> > On Thu, Dec 06, 2018 at 04:17:36PM +0800, Jason Wang wrote:
> > > On 2018/12/6 上午6:54, Michael S. Tsirkin wrote:
> > > > When use_napi is set, let's enable BQLs.  Note: some of the issues are
> > > > similar to wifi.  It's worth considering whether something similar to
> > > > commit 36148c2bbfbe ("mac80211: Adjust TSQ pacing shift") might be
> > > > benefitial.
> > > 
> > > I've played a similar patch several days before. The tricky part is the mode
> > > switching between napi and no napi. We should make sure when the packet is
> > > sent and trakced by BQL,  it should be consumed by BQL as well.
> > 
> > I just went over the patch again and I don't understand this comment.
> > This patch only enabled BQL with tx napi.
> > 
> > Thus there's no mode switching.
> > 
> > What did I miss?
> 
> 
> Consider the case:
> 
> 
> TX NAPI is disabled:
> 
> send N packets
> 
> turn TX NAPI on:
> 
> get tx interrupt
> 
> BQL try to consume those packets when lead WARN for dql.
> 
> 
> Thanks

Can one really switch tx napi on and off? How?
While root can change the napi_tx module parameter, I don't think
that has any effect outside device probe time. What did I miss?



> 
> 
> > 
> > 
> > > I did it by
> > > tracking it through skb->cb.  And deal with the freeze by reset the BQL
> > > status. Patch attached.
> > > 
> > > But when testing with vhost-net, I don't very a stable performance, it was
> > > probably because we batch the used ring updating so tx interrupt may come
> > > randomly. We probably need to implement time bounded coalescing mechanism
> > > which could be configured from userspace.
> > > 
> > > Btw, maybe it's time just enable napi TX by default. I get ~10% TCP_RR
> > > regression on machine without APICv, (haven't found time to test APICv
> > > machine). But consider it was for correctness, I think it's acceptable? Then
> > > we can do optimization on top?
> > > 
> > > 
> > > Thanks
> > > 
> > > 
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > ---
> > > >    drivers/net/virtio_net.c | 27 +++++++++++++++++++--------
> > > >    1 file changed, 19 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > index cecfd77c9f3c..b657bde6b94b 100644
> > > > --- a/drivers/net/virtio_net.c
> > > > +++ b/drivers/net/virtio_net.c
> > > > @@ -1325,7 +1325,8 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
> > > >    	return stats.packets;
> > > >    }
> > > > -static void free_old_xmit_skbs(struct send_queue *sq)
> > > > +static void free_old_xmit_skbs(struct send_queue *sq, struct netdev_queue *txq,
> > > > +			       bool use_napi)
> > > >    {
> > > >    	struct sk_buff *skb;
> > > >    	unsigned int len;
> > > > @@ -1347,6 +1348,9 @@ static void free_old_xmit_skbs(struct send_queue *sq)
> > > >    	if (!packets)
> > > >    		return;
> > > > +	if (use_napi)
> > > > +		netdev_tx_completed_queue(txq, packets, bytes);
> > > > +
> > > >    	u64_stats_update_begin(&sq->stats.syncp);
> > > >    	sq->stats.bytes += bytes;
> > > >    	sq->stats.packets += packets;
> > > > @@ -1364,7 +1368,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
> > > >    		return;
> > > >    	if (__netif_tx_trylock(txq)) {
> > > > -		free_old_xmit_skbs(sq);
> > > > +		free_old_xmit_skbs(sq, txq, true);
> > > >    		__netif_tx_unlock(txq);
> > > >    	}
> > > > @@ -1440,7 +1444,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> > > >    	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
> > > >    	__netif_tx_lock(txq, raw_smp_processor_id());
> > > > -	free_old_xmit_skbs(sq);
> > > > +	free_old_xmit_skbs(sq, txq, true);
> > > >    	__netif_tx_unlock(txq);
> > > >    	virtqueue_napi_complete(napi, sq->vq, 0);
> > > > @@ -1505,13 +1509,15 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> > > >    	struct send_queue *sq = &vi->sq[qnum];
> > > >    	int err;
> > > >    	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> > > > -	bool kick = !skb->xmit_more;
> > > > +	bool more = skb->xmit_more;
> > > >    	bool use_napi = sq->napi.weight;
> > > > +	unsigned int bytes = skb->len;
> > > > +	bool kick;
> > > >    	/* Free up any pending old buffers before queueing new ones. */
> > > > -	free_old_xmit_skbs(sq);
> > > > +	free_old_xmit_skbs(sq, txq, use_napi);
> > > > -	if (use_napi && kick)
> > > > +	if (use_napi && !more)
> > > >    		virtqueue_enable_cb_delayed(sq->vq);
> > > >    	/* timestamp packet in software */
> > > > @@ -1552,7 +1558,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> > > >    		if (!use_napi &&
> > > >    		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> > > >    			/* More just got used, free them then recheck. */
> > > > -			free_old_xmit_skbs(sq);
> > > > +			free_old_xmit_skbs(sq, txq, false);
> > > >    			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
> > > >    				netif_start_subqueue(dev, qnum);
> > > >    				virtqueue_disable_cb(sq->vq);
> > > > @@ -1560,7 +1566,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> > > >    		}
> > > >    	}
> > > > -	if (kick || netif_xmit_stopped(txq)) {
> > > > +	if (use_napi)
> > > > +		kick = __netdev_tx_sent_queue(txq, bytes, more);
> > > > +	else
> > > > +		kick = !more || netif_xmit_stopped(txq);
> > > > +
> > > > +	if (kick) {
> > > >    		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq)) {
> > > >    			u64_stats_update_begin(&sq->stats.syncp);
> > > >    			sq->stats.kicks++;
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* RE: [PATCH v1 0/2] Virtio: fix some vq allocation issues
From: Wang, Wei W @ 2018-12-31  6:03 UTC (permalink / raw)
  To: 'Halil Pasic'
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org,
	mst@redhat.com, cohuck@redhat.com, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, pbonzini@redhat.com,
	dgilbert@redhat.com
In-Reply-To: <20181230070600.512bbb8b@oc2783563651>

On Sunday, December 30, 2018 2:06 PM, Halil Pasic wrote:
> 
> I guess you are the first one trying to read virtio config from within interrupt
> context. AFAICT this never worked.

I'm not sure about "never worked". It seems to work well with virtio-pci.
But looking forward to hearing a solid reason why reading config inside
the handler is forbidden (if that's true).

> About what happens. The apidoc of ccw_device_start() says it needs to be
> called with the ccw device lock held, so ccw_io_helper() tries to take it (since
> forever I guess). OTOH do_cio_interrupt() takes the subchannel lock and
> io_subchannel_initialize_dev() makes the ccw device lock be the subchannel
> lock. That means when one tries to get virtio config form within a cio
> interrupt context we deadlock, because we try to take a lock we already have.
> 
> That said, I don't think this limitation is by design (i.e. intended).
> Maybe Connie can help us with that question. AFAIK we have nothing
> documented regarding this (neither that can nor can't).
> 
> Obviously, there are multiple ways around this problem, and at the moment
> I can't tell which would be my preferred one.

Yes, it's also not difficult to tweak the virtio-balloon code to avoid that issue.
But if that's just an issue with ccw itself, I think it's better to tweak ccw and
remain virtio-balloon unchanged.

Best,
Wei

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio-net: bql support
From: Jason Wang @ 2019-01-02  3:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, virtualization, maxime.coquelin, wexu,
	David S. Miller
In-Reply-To: <20181230134106-mutt-send-email-mst@kernel.org>


On 2018/12/31 上午2:45, Michael S. Tsirkin wrote:
> On Thu, Dec 27, 2018 at 06:00:36PM +0800, Jason Wang wrote:
>> On 2018/12/26 下午11:19, Michael S. Tsirkin wrote:
>>> On Thu, Dec 06, 2018 at 04:17:36PM +0800, Jason Wang wrote:
>>>> On 2018/12/6 上午6:54, Michael S. Tsirkin wrote:
>>>>> When use_napi is set, let's enable BQLs.  Note: some of the issues are
>>>>> similar to wifi.  It's worth considering whether something similar to
>>>>> commit 36148c2bbfbe ("mac80211: Adjust TSQ pacing shift") might be
>>>>> benefitial.
>>>> I've played a similar patch several days before. The tricky part is the mode
>>>> switching between napi and no napi. We should make sure when the packet is
>>>> sent and trakced by BQL,  it should be consumed by BQL as well. I did it by
>>>> tracking it through skb->cb.  And deal with the freeze by reset the BQL
>>>> status. Patch attached.
>>>>
>>>> But when testing with vhost-net, I don't very a stable performance,
>>> So how about increasing TSQ pacing shift then?
>>
>> I can test this. But changing default TCP value is much more than a
>> virtio-net specific thing.
> Well same logic as wifi applies. Unpredictable latencies related
> to radio in one case, to host scheduler in the other.
>
>>>> it was
>>>> probably because we batch the used ring updating so tx interrupt may come
>>>> randomly. We probably need to implement time bounded coalescing mechanism
>>>> which could be configured from userspace.
>>> I don't think it's reasonable to expect userspace to be that smart ...
>>> Why do we need time bounded? used ring is always updated when ring
>>> becomes empty.
>>
>> We don't add used when means BQL may not see the consumed packet in time.
>> And the delay varies based on the workload since we count packets not bytes
>> or time before doing the batched updating.
>>
>> Thanks
> Sorry I still don't get it.
> When nothing is outstanding then we do update the used.
> So if BQL stops userspace from sending packets then
> we get an interrupt and packets start flowing again.


Yes, but how about the cases of multiple flows. That's where I see 
unstable results.


>
> It might be suboptimal, we might need to tune it but I doubt running
> timers is a solution, timer interrupts cause VM exits.


Probably not a timer but a time counter (or event byte counter) in vhost 
to add used and signal guest if it exceeds a value instead of waiting 
the number of packets.


Thanks


>
>>>> Btw, maybe it's time just enable napi TX by default. I get ~10% TCP_RR
>>>> regression on machine without APICv, (haven't found time to test APICv
>>>> machine). But consider it was for correctness, I think it's acceptable? Then
>>>> we can do optimization on top?
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>> ---
>>>>>     drivers/net/virtio_net.c | 27 +++++++++++++++++++--------
>>>>>     1 file changed, 19 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>> index cecfd77c9f3c..b657bde6b94b 100644
>>>>> --- a/drivers/net/virtio_net.c
>>>>> +++ b/drivers/net/virtio_net.c
>>>>> @@ -1325,7 +1325,8 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
>>>>>     	return stats.packets;
>>>>>     }
>>>>> -static void free_old_xmit_skbs(struct send_queue *sq)
>>>>> +static void free_old_xmit_skbs(struct send_queue *sq, struct netdev_queue *txq,
>>>>> +			       bool use_napi)
>>>>>     {
>>>>>     	struct sk_buff *skb;
>>>>>     	unsigned int len;
>>>>> @@ -1347,6 +1348,9 @@ static void free_old_xmit_skbs(struct send_queue *sq)
>>>>>     	if (!packets)
>>>>>     		return;
>>>>> +	if (use_napi)
>>>>> +		netdev_tx_completed_queue(txq, packets, bytes);
>>>>> +
>>>>>     	u64_stats_update_begin(&sq->stats.syncp);
>>>>>     	sq->stats.bytes += bytes;
>>>>>     	sq->stats.packets += packets;
>>>>> @@ -1364,7 +1368,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
>>>>>     		return;
>>>>>     	if (__netif_tx_trylock(txq)) {
>>>>> -		free_old_xmit_skbs(sq);
>>>>> +		free_old_xmit_skbs(sq, txq, true);
>>>>>     		__netif_tx_unlock(txq);
>>>>>     	}
>>>>> @@ -1440,7 +1444,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>>>>>     	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
>>>>>     	__netif_tx_lock(txq, raw_smp_processor_id());
>>>>> -	free_old_xmit_skbs(sq);
>>>>> +	free_old_xmit_skbs(sq, txq, true);
>>>>>     	__netif_tx_unlock(txq);
>>>>>     	virtqueue_napi_complete(napi, sq->vq, 0);
>>>>> @@ -1505,13 +1509,15 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>     	struct send_queue *sq = &vi->sq[qnum];
>>>>>     	int err;
>>>>>     	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>>>>> -	bool kick = !skb->xmit_more;
>>>>> +	bool more = skb->xmit_more;
>>>>>     	bool use_napi = sq->napi.weight;
>>>>> +	unsigned int bytes = skb->len;
>>>>> +	bool kick;
>>>>>     	/* Free up any pending old buffers before queueing new ones. */
>>>>> -	free_old_xmit_skbs(sq);
>>>>> +	free_old_xmit_skbs(sq, txq, use_napi);
>>>>> -	if (use_napi && kick)
>>>>> +	if (use_napi && !more)
>>>>>     		virtqueue_enable_cb_delayed(sq->vq);
>>>>>     	/* timestamp packet in software */
>>>>> @@ -1552,7 +1558,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>     		if (!use_napi &&
>>>>>     		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>>>>>     			/* More just got used, free them then recheck. */
>>>>> -			free_old_xmit_skbs(sq);
>>>>> +			free_old_xmit_skbs(sq, txq, false);
>>>>>     			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
>>>>>     				netif_start_subqueue(dev, qnum);
>>>>>     				virtqueue_disable_cb(sq->vq);
>>>>> @@ -1560,7 +1566,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>     		}
>>>>>     	}
>>>>> -	if (kick || netif_xmit_stopped(txq)) {
>>>>> +	if (use_napi)
>>>>> +		kick = __netdev_tx_sent_queue(txq, bytes, more);
>>>>> +	else
>>>>> +		kick = !more || netif_xmit_stopped(txq);
>>>>> +
>>>>> +	if (kick) {
>>>>>     		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq)) {
>>>>>     			u64_stats_update_begin(&sq->stats.syncp);
>>>>>     			sq->stats.kicks++;
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio-net: bql support
From: Jason Wang @ 2019-01-02  3:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, virtualization, maxime.coquelin, wexu,
	David S. Miller
In-Reply-To: <20181230134539-mutt-send-email-mst@kernel.org>


On 2018/12/31 上午2:48, Michael S. Tsirkin wrote:
> On Thu, Dec 27, 2018 at 06:04:53PM +0800, Jason Wang wrote:
>> On 2018/12/26 下午11:22, Michael S. Tsirkin wrote:
>>> On Thu, Dec 06, 2018 at 04:17:36PM +0800, Jason Wang wrote:
>>>> On 2018/12/6 上午6:54, Michael S. Tsirkin wrote:
>>>>> When use_napi is set, let's enable BQLs.  Note: some of the issues are
>>>>> similar to wifi.  It's worth considering whether something similar to
>>>>> commit 36148c2bbfbe ("mac80211: Adjust TSQ pacing shift") might be
>>>>> benefitial.
>>>> I've played a similar patch several days before. The tricky part is the mode
>>>> switching between napi and no napi. We should make sure when the packet is
>>>> sent and trakced by BQL,  it should be consumed by BQL as well.
>>> I just went over the patch again and I don't understand this comment.
>>> This patch only enabled BQL with tx napi.
>>>
>>> Thus there's no mode switching.
>>>
>>> What did I miss?
>> Consider the case:
>>
>>
>> TX NAPI is disabled:
>>
>> send N packets
>>
>> turn TX NAPI on:
>>
>> get tx interrupt
>>
>> BQL try to consume those packets when lead WARN for dql.
>>
>>
>> Thanks
> Can one really switch tx napi on and off? How?
> While root can change the napi_tx module parameter, I don't think
> that has any effect outside device probe time. What did I miss?
>
>
>

We support switch the mode through ethtool recently. See

commit 0c465be183c7c57a26446df6ea96d8676b865f92
Author: Jason Wang <jasowang@redhat.com>
Date:   Tue Oct 9 10:06:26 2018 +0800

     virtio_net: ethtool tx napi configuration

     Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
     Interrupt moderation is currently not supported, so these accept and
     display the default settings of 0 usec and 1 frame.

     Toggle tx napi through setting tx-frames. So as to not interfere
     with possible future interrupt moderation, value 1 means tx napi while
     value 0 means not.

     Only allow the switching when device is down for simplicity.

     Link: https://patchwork.ozlabs.org/patch/948149/
     Suggested-by: Jason Wang <jasowang@redhat.com>
     Signed-off-by: Willem de Bruijn <willemb@google.com>
     Signed-off-by: Jason Wang <jasowang@redhat.com>
     Signed-off-by: David S. Miller <davem@davemloft.net>

Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [virtio-dev] RE: [PATCH v1 0/2] Virtio: fix some vq allocation issues
From: Cornelia Huck @ 2019-01-02  9:53 UTC (permalink / raw)
  To: Halil Pasic
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org,
	mst@redhat.com, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, pbonzini@redhat.com,
	dgilbert@redhat.com
In-Reply-To: <20190101004019.7f20aafa@oc2783563651>

On Tue, 1 Jan 2019 00:40:19 +0100
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Mon, 31 Dec 2018 06:03:51 +0000
> "Wang, Wei W" <wei.w.wang@intel.com> wrote:
> 
> > On Sunday, December 30, 2018 2:06 PM, Halil Pasic wrote:  
> > > 
> > > I guess you are the first one trying to read virtio config from within interrupt
> > > context. AFAICT this never worked.  
> > 
> > I'm not sure about "never worked". It seems to work well with virtio-pci.
> > But looking forward to hearing a solid reason why reading config inside
> > the handler is forbidden (if that's true).  
> 
> By "never worked" I meant "never worked with virtio-ccw". Sorry
> about the misunderstanding. Seems I've also failed to convey that I don't
> know if reading config inside the handler is forbidden or not. So please
> don't expect me providing the solid reasons you are looking forward to.

It won't work with the current code, and this is all a bit ugly :( More
verbose explanation below.

> 
> >   
> > > About what happens. The apidoc of ccw_device_start() says it needs to be
> > > called with the ccw device lock held, so ccw_io_helper() tries to take it (since
> > > forever I guess). OTOH do_cio_interrupt() takes the subchannel lock and
> > > io_subchannel_initialize_dev() makes the ccw device lock be the subchannel
> > > lock. That means when one tries to get virtio config form within a cio
> > > interrupt context we deadlock, because we try to take a lock we already have.
> > > 
> > > That said, I don't think this limitation is by design (i.e. intended).
> > > Maybe Connie can help us with that question. AFAIK we have nothing
> > > documented regarding this (neither that can nor can't).

The main problem is that channel I/O is a fundamentally asynchronous
mechanism. As channel devices don't have the concept of config spaces
(or some other things that virtio needs), I decided to map
reading/writing the config space to channel commands. Starting I/O on a
subchannel always needs the lock (to avoid races on the subchannel),
and the asynchronous interrupt for that I/O needs the lock as well (for
the same reason; things like the scsw contain state that you want to
access without races). A config change also means that the subchannel
becomes state pending (and an interrupt is made pending), so the
subchannel lock is taken for that path as well. (Virtqueue
notifications are handled differently on modern QEMU, but that does not
come into play here.)

> > > 
> > > Obviously, there are multiple ways around this problem, and at the moment
> > > I can't tell which would be my preferred one.  
> > 
> > Yes, it's also not difficult to tweak the virtio-balloon code to avoid that issue.
> > But if that's just an issue with ccw itself, I think it's better to tweak ccw and
> > remain virtio-balloon unchanged.
> >   
> 
> As I said, at the moment I don't have a preference regarding the fix,
> partly because I'm not sure if "reading config inside the handler" is OK
> or not. Maybe Connie or Michael can help us here. I'm however sure that
> commit 86a5597 "virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT"
> breaks virtio-balloon with the ccw transport (i.e. effectively breaks 
> virtio-balloon on s390): it used to work before and does not work
> after.

Yes, that's unfortunate.

> 
> AFAICT tweaking the balloon code may be simpler than tweaking the
> virtio-ccw (transport code). ccw_io_helper() relies on getting
> an interrupt when the issued IO is done. If virtio-ccw is buggy, it
> needs to be fixed, but I'm not sure it is.

I would not call virtio-ccw buggy, but it has some constraints that
virtio-pci apparently doesn't have (and which did not show up so far;
e.g. virtio-blk schedules a work item on config change, so there's no
deadlock there.)

One way to get out of that constraint (don't interact with the config
space directly in the config changed handler) would be to schedule a
work item in virtio-ccw that calls virtio_config_changed() for the
device. My understanding is that delaying the notification to a work
queue would be fine.

From what I can see, that should make us safe on any modern QEMU (that
uses adapter interrupts). There still might be problems if we are using
classic subchannel interrupts for virtqueue notifications and the
handler interacts with the config space, but I'm not sure whether it is
worth spending time on that.

^ permalink raw reply

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
From: Jason Wang @ 2019-01-02 11:38 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181230132614-mutt-send-email-mst@kernel.org>


On 2018/12/31 上午2:30, Michael S. Tsirkin wrote:
> On Thu, Dec 27, 2018 at 05:39:21PM +0800, Jason Wang wrote:
>> On 2018/12/26 下午11:02, Michael S. Tsirkin wrote:
>>> On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote:
>>>> On 2018/12/25 下午8:50, Michael S. Tsirkin wrote:
>>>>> On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
>>>>>> On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
>>>>>>> On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
>>>>>>>> On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
>>>>>>>>> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
>>>>>>>>>> On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>>>>>>>>>>>> It was noticed that the copy_user() friends that was used to access
>>>>>>>>>>>> virtqueue metdata tends to be very expensive for dataplane
>>>>>>>>>>>> implementation like vhost since it involves lots of software check,
>>>>>>>>>>>> speculation barrier, hardware feature toggling (e.g SMAP). The
>>>>>>>>>>>> extra cost will be more obvious when transferring small packets.
>>>>>>>>>>>>
>>>>>>>>>>>> This patch tries to eliminate those overhead by pin vq metadata pages
>>>>>>>>>>>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>>>>>>>>>>>> those mappings and memory accessors are modified to use pointers to
>>>>>>>>>>>> access the metadata directly.
>>>>>>>>>>>>
>>>>>>>>>>>> Note, this was only done when device IOTLB is not enabled. We could
>>>>>>>>>>>> use similar method to optimize it in the future.
>>>>>>>>>>>>
>>>>>>>>>>>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>>>>>>>>>>>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>>>>>>>>>>>
>>>>>>>>>>>> Before: ~5.0Mpps
>>>>>>>>>>>> After:  ~6.1Mpps
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>        drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>        drivers/vhost/vhost.h |  11 +++
>>>>>>>>>>>>        2 files changed, 189 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>>>>>>>> index bafe39d2e637..1bd24203afb6 100644
>>>>>>>>>>>> --- a/drivers/vhost/vhost.c
>>>>>>>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>>>>>>>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>>>>>>>>>>        		vq->indirect = NULL;
>>>>>>>>>>>>        		vq->heads = NULL;
>>>>>>>>>>>>        		vq->dev = dev;
>>>>>>>>>>>> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>>>>>>>>>>>> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>>>>>>>>>>>> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>>>>>>>>>>>        		mutex_init(&vq->mutex);
>>>>>>>>>>>>        		vhost_vq_reset(dev, vq);
>>>>>>>>>>>>        		if (vq->handle_kick)
>>>>>>>>>>>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>>>>>>>>>>>        	spin_unlock(&dev->iotlb_lock);
>>>>>>>>>>>>        }
>>>>>>>>>>>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>>>>>>>>>>>> +			   size_t size, int write)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	struct page **pages;
>>>>>>>>>>>> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>>>>>>>>>>>> +	int npinned;
>>>>>>>>>>>> +	void *vaddr;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>>>>>>>>>>>> +	if (!pages)
>>>>>>>>>>>> +		return -ENOMEM;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
>>>>>>>>>>>> +	if (npinned != npages)
>>>>>>>>>>>> +		goto err;
>>>>>>>>>>>> +
>>>>>>>>>>> As I said I have doubts about the whole approach, but this
>>>>>>>>>>> implementation in particular isn't a good idea
>>>>>>>>>>> as it keeps the page around forever.
>>>>>>>> The pages wil be released during set features.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> So no THP, no NUMA rebalancing,
>>>>>>>> For THP, we will probably miss 2 or 4 pages, but does this really matter
>>>>>>>> consider the gain we have?
>>>>>>> We as in vhost? networking isn't the only thing guest does.
>>>>>>> We don't even know if this guest does a lot of networking.
>>>>>>> You don't
>>>>>>> know what else is in this huge page. Can be something very important
>>>>>>> that guest touches all the time.
>>>>>> Well, the probability should be very small consider we usually give several
>>>>>> gigabytes to guest. The rest of the pages that doesn't sit in the same
>>>>>> hugepage with metadata can still be merged by THP.  Anyway, I can test the
>>>>>> differences.
>>>>> Thanks!
>>>>>
>>>>>>>> For NUMA rebalancing, I'm even not quite sure if
>>>>>>>> it can helps for the case of IPC (vhost). It looks to me the worst case it
>>>>>>>> may cause page to be thrash between nodes if vhost and userspace are running
>>>>>>>> in two nodes.
>>>>>>> So again it's a gain for vhost but has a completely unpredictable effect on
>>>>>>> other functionality of the guest.
>>>>>>>
>>>>>>> That's what bothers me with this approach.
>>>>>> So:
>>>>>>
>>>>>> - The rest of the pages could still be balanced to other nodes, no?
>>>>>>
>>>>>> - try to balance metadata pages (belongs to co-operate processes) itself is
>>>>>> still questionable
>>>>> I am not sure why. It should be easy enough to force the VCPU and vhost
>>>>> to move (e.g. start them pinned to 1 cpu, then pin them to another one).
>>>>> Clearly sometimes this would be necessary for load balancing reasons.
>>>> Yes, but it looks to me the part of motivation of auto NUMA is to avoid
>>>> manual pinning.
>>> ... of memory. Yes.
>>>
>>>
>>>>> With autonuma after a while (could take seconds but it will happen) the
>>>>> memory will migrate.
>>>>>
>>>> Yes. As you mentioned during the discuss, I wonder we could do it similarly
>>>> through mmu notifier like APIC access page in commit c24ae0dcd3e ("kvm: x86:
>>>> Unpin and remove kvm_arch->apic_access_page")
>>> That would be a possible approach.
>>
>> Yes, this looks possible, and the conversion seems not hard. Let me have a
>> try with this.
>>
>>
>> [...]
>>
>>
>>>>>>>>> I don't see how a kthread makes any difference. We do have a validation
>>>>>>>>> step which makes some difference.
>>>>>>>> The problem is not kthread but the address of userspace address. The
>>>>>>>> addresses of vq metadata tends to be consistent for a while, and vhost knows
>>>>>>>> they will be frequently. SMAP doesn't help too much in this case.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>> It's true for a real life applications but a malicious one
>>>>>>> can call the setup ioctls any number of times. And SMAP is
>>>>>>> all about malcious applications.
>>>>>> We don't do this in the path of ioctl, there's no context switch between
>>>>>> userspace and kernel in the worker thread. SMAP is used to prevent kernel
>>>>>> from accessing userspace pages unexpectedly which is not the case for
>>>>>> metadata access.
>>>>>>
>>>>>> Thanks
>>>>> OK let's forget smap for now.
>>>> Some numbers I measured:
>>>>
>>>> On an old Sandy bridge machine without SMAP support. Remove speculation
>>>> barrier boost the performance from 4.6Mpps to 5.1Mpps
>>>>
>>>> On a newer Broadwell machine with SMAP support. Remove speculation barrier
>>>> only gives 2%-5% improvement, disable SMAP completely through Kconfig boost
>>>> 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps, it
>>>> only bypass SMAP for metadata).
>>>>
>>>> So it looks like for recent machine, SMAP becomes pain point when the copy
>>>> is short (e.g 64B) for high PPS.
>>>>
>>>> Thanks
>>> Thanks a lot for looking into this!
>>>
>>> So first of all users can just boot with nosmap, right?
>>> What's wrong with that?
>>
>> Nothing wrong, just realize we had this kernel parameter.
>>
>>
>>>    Yes it's not fine-grained but OTOH
>>> it's easy to understand.
>>>
>>> And I guess this confirms that if we are going to worry
>>> about smap enabled, we need to look into packet copies
>>> too, not just meta-data.
>>
>> For packet copies, we can do batch copy which is pretty simple for the case
>> of XDP. I've already had patches for this.
>>
>>
>>> Vaguely could see a module option (off by default)
>>> where vhost basically does user_access_begin
>>> when it starts running, then uses unsafe accesses
>>> in vhost and tun and then user_access_end.
>>
>> Using user_access_begin() is more tricky than imaged. E.g it requires:
>>
>> - userspace address to be validated before through access_ok() [1]
> This part is fine I think - addresses come from the memory
> map and when userspace supplies the memory map
> we validate everything with access_ok.
> Well do we validate with the iotlb too? Don't see it right now
> so maybe not but it's easy to add.


Yes, it's not hard.


>
>> - It doesn't support calling a function that does explicit schedule since
>> SMAP/PAN state is not maintained through schedule() [2]
>>
>> [1] https://lwn.net/Articles/736348/
>>
>> [2] https://lkml.org/lkml/2018/11/23/430
>>
>> So calling user_access_begin() all the time when vhost is running seems
>> pretty dangerous.
> Yes it requires some rework e.g. to try getting memory with
> GFP_ATOMIC. We could then do a slow path with GFP_KERNEL
> if that fails.


I'm not sure this is the only part that needs care. Consider all the 
under layer network or block codes assumes a process context, it's not 
easy to figure out all I'm afraid. And even if we could, it's hard to 
prevent it from being added in the future.

Thanks


>
>> For a better batched datacopy, I tend to build not only XDP but also skb in
>> vhost in the future.
>>
>> Thanks
> Sure, why not.
>
>>>
>>>>>>>>>> Packet or AF_XDP benefit from
>>>>>>>>>> accessing metadata directly, we should do it as well.
>>>>>>>>>>
>>>>>>>>>> Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [virtio-dev] RE: [PATCH v1 0/2] Virtio: fix some vq allocation issues
From: Cornelia Huck @ 2019-01-02 13:23 UTC (permalink / raw)
  To: Halil Pasic
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org,
	mst@redhat.com, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, pbonzini@redhat.com,
	dgilbert@redhat.com
In-Reply-To: <20190102105314.0b4e2485.cohuck@redhat.com>

On Wed, 2 Jan 2019 10:53:14 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Tue, 1 Jan 2019 00:40:19 +0100
> Halil Pasic <pasic@linux.ibm.com> wrote:

> > As I said, at the moment I don't have a preference regarding the fix,
> > partly because I'm not sure if "reading config inside the handler" is OK
> > or not. Maybe Connie or Michael can help us here. I'm however sure that
> > commit 86a5597 "virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT"
> > breaks virtio-balloon with the ccw transport (i.e. effectively breaks 
> > virtio-balloon on s390): it used to work before and does not work
> > after.  
> 
> Yes, that's unfortunate.
> 
> > 
> > AFAICT tweaking the balloon code may be simpler than tweaking the
> > virtio-ccw (transport code). ccw_io_helper() relies on getting
> > an interrupt when the issued IO is done. If virtio-ccw is buggy, it
> > needs to be fixed, but I'm not sure it is.  
> 
> I would not call virtio-ccw buggy, but it has some constraints that
> virtio-pci apparently doesn't have (and which did not show up so far;
> e.g. virtio-blk schedules a work item on config change, so there's no
> deadlock there.)
> 
> One way to get out of that constraint (don't interact with the config
> space directly in the config changed handler) would be to schedule a
> work item in virtio-ccw that calls virtio_config_changed() for the
> device. My understanding is that delaying the notification to a work
> queue would be fine.

Unfortunately, calling virtio_config_changed() from a work item is not
enough: That function takes the config_lock, and the virtio-ccw code to
get the config both needs to allocate some memory and call schedule :/

The best option really seems to be
- have virtio-balloon move the handling of the config change onto a
  workqueue or something like that, and
- document that you cannot read/write the virtio config space from an
  atomic context

Unless someone has a better idea?

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio-net: bql support
From: Michael S. Tsirkin @ 2019-01-02 13:54 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, linux-kernel, virtualization, maxime.coquelin, wexu,
	David S. Miller
In-Reply-To: <01e0fe8f-bd75-c39b-9d77-c5a9baf87348@redhat.com>

On Wed, Jan 02, 2019 at 11:30:11AM +0800, Jason Wang wrote:
> 
> On 2018/12/31 上午2:48, Michael S. Tsirkin wrote:
> > On Thu, Dec 27, 2018 at 06:04:53PM +0800, Jason Wang wrote:
> > > On 2018/12/26 下午11:22, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 06, 2018 at 04:17:36PM +0800, Jason Wang wrote:
> > > > > On 2018/12/6 上午6:54, Michael S. Tsirkin wrote:
> > > > > > When use_napi is set, let's enable BQLs.  Note: some of the issues are
> > > > > > similar to wifi.  It's worth considering whether something similar to
> > > > > > commit 36148c2bbfbe ("mac80211: Adjust TSQ pacing shift") might be
> > > > > > benefitial.
> > > > > I've played a similar patch several days before. The tricky part is the mode
> > > > > switching between napi and no napi. We should make sure when the packet is
> > > > > sent and trakced by BQL,  it should be consumed by BQL as well.
> > > > I just went over the patch again and I don't understand this comment.
> > > > This patch only enabled BQL with tx napi.
> > > > 
> > > > Thus there's no mode switching.
> > > > 
> > > > What did I miss?
> > > Consider the case:
> > > 
> > > 
> > > TX NAPI is disabled:
> > > 
> > > send N packets
> > > 
> > > turn TX NAPI on:
> > > 
> > > get tx interrupt
> > > 
> > > BQL try to consume those packets when lead WARN for dql.
> > > 
> > > 
> > > Thanks
> > Can one really switch tx napi on and off? How?
> > While root can change the napi_tx module parameter, I don't think
> > that has any effect outside device probe time. What did I miss?
> > 
> > 
> > 
> 
> We support switch the mode through ethtool recently. See
> 
> commit 0c465be183c7c57a26446df6ea96d8676b865f92
> Author: Jason Wang <jasowang@redhat.com>
> Date:   Tue Oct 9 10:06:26 2018 +0800
> 
>     virtio_net: ethtool tx napi configuration
> 
>     Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
>     Interrupt moderation is currently not supported, so these accept and
>     display the default settings of 0 usec and 1 frame.
> 
>     Toggle tx napi through setting tx-frames. So as to not interfere
>     with possible future interrupt moderation, value 1 means tx napi while
>     value 0 means not.
> 
>     Only allow the switching when device is down for simplicity.
> 
>     Link: https://patchwork.ozlabs.org/patch/948149/
>     Suggested-by: Jason Wang <jasowang@redhat.com>
>     Signed-off-by: Willem de Bruijn <willemb@google.com>
>     Signed-off-by: Jason Wang <jasowang@redhat.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> Thanks


It's disabled when device is up - isn't that enough?

-- 
MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio-net: bql support
From: Michael S. Tsirkin @ 2019-01-02 13:59 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, linux-kernel, virtualization, maxime.coquelin, wexu,
	David S. Miller
In-Reply-To: <b4f06d11-4761-dabb-f641-5fc05c1c34fc@redhat.com>

On Wed, Jan 02, 2019 at 11:28:43AM +0800, Jason Wang wrote:
> 
> On 2018/12/31 上午2:45, Michael S. Tsirkin wrote:
> > On Thu, Dec 27, 2018 at 06:00:36PM +0800, Jason Wang wrote:
> > > On 2018/12/26 下午11:19, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 06, 2018 at 04:17:36PM +0800, Jason Wang wrote:
> > > > > On 2018/12/6 上午6:54, Michael S. Tsirkin wrote:
> > > > > > When use_napi is set, let's enable BQLs.  Note: some of the issues are
> > > > > > similar to wifi.  It's worth considering whether something similar to
> > > > > > commit 36148c2bbfbe ("mac80211: Adjust TSQ pacing shift") might be
> > > > > > benefitial.
> > > > > I've played a similar patch several days before. The tricky part is the mode
> > > > > switching between napi and no napi. We should make sure when the packet is
> > > > > sent and trakced by BQL,  it should be consumed by BQL as well. I did it by
> > > > > tracking it through skb->cb.  And deal with the freeze by reset the BQL
> > > > > status. Patch attached.
> > > > > 
> > > > > But when testing with vhost-net, I don't very a stable performance,
> > > > So how about increasing TSQ pacing shift then?
> > > 
> > > I can test this. But changing default TCP value is much more than a
> > > virtio-net specific thing.
> > Well same logic as wifi applies. Unpredictable latencies related
> > to radio in one case, to host scheduler in the other.
> > 
> > > > > it was
> > > > > probably because we batch the used ring updating so tx interrupt may come
> > > > > randomly. We probably need to implement time bounded coalescing mechanism
> > > > > which could be configured from userspace.
> > > > I don't think it's reasonable to expect userspace to be that smart ...
> > > > Why do we need time bounded? used ring is always updated when ring
> > > > becomes empty.
> > > 
> > > We don't add used when means BQL may not see the consumed packet in time.
> > > And the delay varies based on the workload since we count packets not bytes
> > > or time before doing the batched updating.
> > > 
> > > Thanks
> > Sorry I still don't get it.
> > When nothing is outstanding then we do update the used.
> > So if BQL stops userspace from sending packets then
> > we get an interrupt and packets start flowing again.
> 
> 
> Yes, but how about the cases of multiple flows. That's where I see unstable
> results.
> 
> 
> > 
> > It might be suboptimal, we might need to tune it but I doubt running
> > timers is a solution, timer interrupts cause VM exits.
> 
> 
> Probably not a timer but a time counter (or event byte counter) in vhost to
> add used and signal guest if it exceeds a value instead of waiting the
> number of packets.
> 
> 
> Thanks

Well we already have VHOST_NET_WEIGHT - is it too big then?

And maybe we should expose the "MORE" flag in the descriptor -
do you think that will help?



> 
> > 
> > > > > Btw, maybe it's time just enable napi TX by default. I get ~10% TCP_RR
> > > > > regression on machine without APICv, (haven't found time to test APICv
> > > > > machine). But consider it was for correctness, I think it's acceptable? Then
> > > > > we can do optimization on top?
> > > > > 
> > > > > 
> > > > > Thanks
> > > > > 
> > > > > 
> > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > ---
> > > > > >     drivers/net/virtio_net.c | 27 +++++++++++++++++++--------
> > > > > >     1 file changed, 19 insertions(+), 8 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > > index cecfd77c9f3c..b657bde6b94b 100644
> > > > > > --- a/drivers/net/virtio_net.c
> > > > > > +++ b/drivers/net/virtio_net.c
> > > > > > @@ -1325,7 +1325,8 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
> > > > > >     	return stats.packets;
> > > > > >     }
> > > > > > -static void free_old_xmit_skbs(struct send_queue *sq)
> > > > > > +static void free_old_xmit_skbs(struct send_queue *sq, struct netdev_queue *txq,
> > > > > > +			       bool use_napi)
> > > > > >     {
> > > > > >     	struct sk_buff *skb;
> > > > > >     	unsigned int len;
> > > > > > @@ -1347,6 +1348,9 @@ static void free_old_xmit_skbs(struct send_queue *sq)
> > > > > >     	if (!packets)
> > > > > >     		return;
> > > > > > +	if (use_napi)
> > > > > > +		netdev_tx_completed_queue(txq, packets, bytes);
> > > > > > +
> > > > > >     	u64_stats_update_begin(&sq->stats.syncp);
> > > > > >     	sq->stats.bytes += bytes;
> > > > > >     	sq->stats.packets += packets;
> > > > > > @@ -1364,7 +1368,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
> > > > > >     		return;
> > > > > >     	if (__netif_tx_trylock(txq)) {
> > > > > > -		free_old_xmit_skbs(sq);
> > > > > > +		free_old_xmit_skbs(sq, txq, true);
> > > > > >     		__netif_tx_unlock(txq);
> > > > > >     	}
> > > > > > @@ -1440,7 +1444,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> > > > > >     	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
> > > > > >     	__netif_tx_lock(txq, raw_smp_processor_id());
> > > > > > -	free_old_xmit_skbs(sq);
> > > > > > +	free_old_xmit_skbs(sq, txq, true);
> > > > > >     	__netif_tx_unlock(txq);
> > > > > >     	virtqueue_napi_complete(napi, sq->vq, 0);
> > > > > > @@ -1505,13 +1509,15 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> > > > > >     	struct send_queue *sq = &vi->sq[qnum];
> > > > > >     	int err;
> > > > > >     	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> > > > > > -	bool kick = !skb->xmit_more;
> > > > > > +	bool more = skb->xmit_more;
> > > > > >     	bool use_napi = sq->napi.weight;
> > > > > > +	unsigned int bytes = skb->len;
> > > > > > +	bool kick;
> > > > > >     	/* Free up any pending old buffers before queueing new ones. */
> > > > > > -	free_old_xmit_skbs(sq);
> > > > > > +	free_old_xmit_skbs(sq, txq, use_napi);
> > > > > > -	if (use_napi && kick)
> > > > > > +	if (use_napi && !more)
> > > > > >     		virtqueue_enable_cb_delayed(sq->vq);
> > > > > >     	/* timestamp packet in software */
> > > > > > @@ -1552,7 +1558,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> > > > > >     		if (!use_napi &&
> > > > > >     		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> > > > > >     			/* More just got used, free them then recheck. */
> > > > > > -			free_old_xmit_skbs(sq);
> > > > > > +			free_old_xmit_skbs(sq, txq, false);
> > > > > >     			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
> > > > > >     				netif_start_subqueue(dev, qnum);
> > > > > >     				virtqueue_disable_cb(sq->vq);
> > > > > > @@ -1560,7 +1566,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> > > > > >     		}
> > > > > >     	}
> > > > > > -	if (kick || netif_xmit_stopped(txq)) {
> > > > > > +	if (use_napi)
> > > > > > +		kick = __netdev_tx_sent_queue(txq, bytes, more);
> > > > > > +	else
> > > > > > +		kick = !more || netif_xmit_stopped(txq);
> > > > > > +
> > > > > > +	if (kick) {
> > > > > >     		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq)) {
> > > > > >     			u64_stats_update_begin(&sq->stats.syncp);
> > > > > >     			sq->stats.kicks++;
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [virtio-dev] RE: [PATCH v1 0/2] Virtio: fix some vq allocation issues
From: Cornelia Huck @ 2019-01-02 18:02 UTC (permalink / raw)
  To: Halil Pasic
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org,
	mst@redhat.com, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, pbonzini@redhat.com,
	dgilbert@redhat.com
In-Reply-To: <20190102165919.0cd9365a@oc2783563651>

On Wed, 2 Jan 2019 16:59:19 +0100
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Wed, 2 Jan 2019 14:23:38 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
> > On Wed, 2 Jan 2019 10:53:14 +0100
> > Cornelia Huck <cohuck@redhat.com> wrote:
> >   
> > > On Tue, 1 Jan 2019 00:40:19 +0100
> > > Halil Pasic <pasic@linux.ibm.com> wrote:  

> > > > AFAICT tweaking the balloon code may be simpler than tweaking the
> > > > virtio-ccw (transport code). ccw_io_helper() relies on getting
> > > > an interrupt when the issued IO is done. If virtio-ccw is buggy, it
> > > > needs to be fixed, but I'm not sure it is.    
> > > 
> > > I would not call virtio-ccw buggy, but it has some constraints that
> > > virtio-pci apparently doesn't have (and which did not show up so far;
> > > e.g. virtio-blk schedules a work item on config change, so there's no
> > > deadlock there.)
> > > 
> > > One way to get out of that constraint (don't interact with the config
> > > space directly in the config changed handler) would be to schedule a
> > > work item in virtio-ccw that calls virtio_config_changed() for the
> > > device. My understanding is that delaying the notification to a work
> > > queue would be fine.  
> > 
> > Unfortunately, calling virtio_config_changed() from a work item is not
> > enough: That function takes the config_lock, and the virtio-ccw code to
> > get the config both needs to allocate some memory and call schedule :/
> > 
> > The best option really seems to be
> > - have virtio-balloon move the handling of the config change onto a
> >   workqueue or something like that, and
> > - document that you cannot read/write the virtio config space from an
> >   atomic context
> > 
> > Unless someone has a better idea?
> >   
> 
> I wonder, would making config_lock a mutex suffice?

Unless I'm mistaken, you can't take a mutex in an interrupt path.

> I've already hinted that a virtio-balloon side fix is probably the
> simpler variant. 

Yes, I think so as well.

> I agree, let's fix the regression first, and think about wether to teach
> virtio-ccw to allow config manipulation from virtio_config_changed() or
> not later.

Or whether we can tweak the virtio code instead. But I agree, let's get
things working again first.

^ permalink raw reply

* Re: [RFC PATCH 1/1] s390/virtio: handle find on invalid queue gracefully
From: Cornelia Huck @ 2019-01-02 18:25 UTC (permalink / raw)
  To: Halil Pasic; +Cc: linux-s390, kvm, virtualization
In-Reply-To: <20190102175020.45251-1-pasic@linux.ibm.com>

On Wed,  2 Jan 2019 18:50:20 +0100
Halil Pasic <pasic@linux.ibm.com> wrote:

> A queue with a capacity of zero is clearly not a valid virtio queue.
> Some emulators report zero queue size if queried with an invalid queue
> index. Instead of crashing in this case let us just return -EINVAL. To
> make that work properly, let us fix the notifier cleanup logic as well.
> 
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> ---
> 
> This patch is motivated by commit 86a5597 "virtio-balloon:
> VIRTIO_BALLOON_F_FREE_PAGE_HINT" (Wei Wang, 2018-08-27) which triggered
> the described scenario.  The emulator in question is the current QEMU.
> The problem we run into is the underflow in the following loop
> in  __vring_new_virtqueue():
> for (i = 0; i < vring.num-1; i++)
> 	vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1)
> Namely vring.num is an unsigned int.
> 
> RFC because I'm not sure about -EINVAL being a good choice, and about
> us caring about what happens if a virtio driver misbehaves like described.

For virtio-pci, the spec says that a zero queue size means that the
queue is unavailable. I don't think we have specified that explicitly
for virtio-ccw, but it does make sense.

virtio-pci returns -ENOENT in that case, which might be a good choice
here as well.

> 
> ---
>  drivers/s390/virtio/virtio_ccw.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
> index fc9dbad476c0..147927ed4fca 100644
> --- a/drivers/s390/virtio/virtio_ccw.c
> +++ b/drivers/s390/virtio/virtio_ccw.c
> @@ -272,6 +272,8 @@ static void virtio_ccw_drop_indicators(struct virtio_ccw_device *vcdev)
>  {
>  	struct virtio_ccw_vq_info *info;
>  
> +	if (!vcdev->airq_info)
> +		return;

Which case is this guarding against? names[i] was NULL for every index?

>  	list_for_each_entry(info, &vcdev->virtqueues, node)
>  		drop_airq_indicator(info->vq, vcdev->airq_info);
>  }
> @@ -514,6 +516,10 @@ static struct virtqueue *virtio_ccw_setup_vq(struct virtio_device *vdev,
>  		err = info->num;
>  		goto out_err;
>  	}
> +	if (info->num == 0) {
> +		err = -EINVAL;
> +		goto out_err;
> +	}
>  	size = PAGE_ALIGN(vring_size(info->num, KVM_VIRTIO_CCW_RING_ALIGN));
>  	info->queue = alloc_pages_exact(size, GFP_KERNEL | __GFP_ZERO);
>  	if (info->queue == NULL) {

^ permalink raw reply

* Re: [RFC PATCH V3 0/5] Hi:
From: Michael S. Tsirkin @ 2019-01-02 20:47 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, davem, linux-kernel, kvm, virtualization
In-Reply-To: <20181229124656.3900-1-jasowang@redhat.com>

On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling.

Will review, thanks!
One questions that comes to mind is whether it's all about bypassing
stac/clac.  Could you please include a performance comparison with
nosmap?

> 
> Test shows about 24% improvement on TX PPS. It should benefit other
> cases as well.
>
> Changes from V2:
> - fix buggy range overlapping check
> - tear down MMU notifier during vhost ioctl to make sure invalidation
>   request can read metadata userspace address and vq size without
>   holding vq mutex.
> Changes from V1:
> - instead of pinning pages, use MMU notifier to invalidate vmaps and
>   remap duing metadata prefetch
> - fix build warning on MIPS
> 
> Jason Wang (5):
>   vhost: generalize adding used elem
>   vhost: fine grain userspace memory accessors
>   vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
>   vhost: introduce helpers to get the size of metadata area
>   vhost: access vq metadata through kernel virtual address
> 
>  drivers/vhost/net.c   |   4 +-
>  drivers/vhost/vhost.c | 416 +++++++++++++++++++++++++++++++++++++-----
>  drivers/vhost/vhost.h |  15 +-
>  3 files changed, 384 insertions(+), 51 deletions(-)
> 
> -- 
> 2.17.1

^ permalink raw reply

* [PATCH RFC 0/4] barriers using data dependency
From: Michael S. Tsirkin @ 2019-01-02 20:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrea Parri, linux-arch, Paul E. McKenney, Peter Zijlstra,
	Daniel Lustig, Akira Yokosawa, Will Deacon, Nicholas Piggin,
	virtualization, David Howells, Alan Stern, netdev, Luc Maranget,
	Jade Alglave, Boqun Feng

So as explained in Documentation/memory-barriers.txt e.g.
a load followed by a store require a full memory barrier,
to avoid store being ordered before the load.
Similarly load-load requires a read memory barrier.

Thinking about it, we can actually create a data dependency
by mixing the first loaded value into the pointer being
accessed.

This adds an API for this and uses it in virtio.

Written over the holiday and build tested only so far.

This patchset is also suboptimal on e.g. x86 where e.g. smp_rmb is a nop.

Sending out for early feedback/flames.

Michael S. Tsirkin (4):
  include/linux/compiler*.h: fix OPTIMIZER_HIDE_VAR
  include/linux/compiler.h: allow memory operands
  barriers: convert a control to a data dependency
  virtio: use dependent_ptr_mb

 Documentation/memory-barriers.txt | 20 ++++++++++++++++++++
 arch/alpha/include/asm/barrier.h  |  1 +
 drivers/virtio/virtio_ring.c      |  6 ++++--
 include/asm-generic/barrier.h     | 18 ++++++++++++++++++
 include/linux/compiler-clang.h    |  5 ++---
 include/linux/compiler-gcc.h      |  4 ----
 include/linux/compiler-intel.h    |  4 +---
 include/linux/compiler.h          |  8 +++++++-
 8 files changed, 53 insertions(+), 13 deletions(-)

-- 
MST

^ permalink raw reply

* [PATCH RFC 1/4] include/linux/compiler*.h: fix OPTIMIZER_HIDE_VAR
From: Michael S. Tsirkin @ 2019-01-02 20:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrea Parri, Peter Zijlstra, Akira Yokosawa, Will Deacon,
	virtualization, David Howells, linux-arch, linux-sparse,
	Alan Stern, Paul E. McKenney, Boqun Feng, Daniel Lustig,
	Nicholas Piggin, Luc Maranget, Eli Friedman, Jade Alglave, netdev,
	Nick Desaulniers, Joe Perches, Linus Torvalds, Luc Van Oostenryck
In-Reply-To: <20190102205715.14054-1-mst@redhat.com>

Since commit 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h
mutually exclusive") clang no longer reuses the OPTIMIZER_HIDE_VAR macro
from compiler-gcc - instead it gets the version in
include/linux/compiler.h.  Unfortunately that version doesn't actually
prevent compiler from optimizing out the variable.

Fix up by moving the macro out from compiler-gcc.h to compiler.h.
Compilers without incline asm support will keep working
since it's protected by an ifdef.

Also fix up comments to match reality since we are no longer overriding
any macros.

Build-tested with gcc and clang.

Fixes: 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually exclusive")
Cc: Eli Friedman <efriedma@codeaurora.org>
Cc: Joe Perches <joe@perches.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/compiler-clang.h | 5 ++---
 include/linux/compiler-gcc.h   | 4 ----
 include/linux/compiler-intel.h | 4 +---
 include/linux/compiler.h       | 4 +++-
 4 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h
index 3e7dafb3ea80..7ddaeb5182e3 100644
--- a/include/linux/compiler-clang.h
+++ b/include/linux/compiler-clang.h
@@ -3,9 +3,8 @@
 #error "Please don't include <linux/compiler-clang.h> directly, include <linux/compiler.h> instead."
 #endif
 
-/* Some compiler specific definitions are overwritten here
- * for Clang compiler
- */
+/* Compiler specific definitions for Clang compiler */
+
 #define uninitialized_var(x) x = *(&(x))
 
 /* same as gcc, this was present in clang-2.6 so we can assume it works
diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index 2010493e1040..72054d9f0eaa 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -58,10 +58,6 @@
 	(typeof(ptr)) (__ptr + (off));					\
 })
 
-/* Make the optimizer believe the variable can be manipulated arbitrarily. */
-#define OPTIMIZER_HIDE_VAR(var)						\
-	__asm__ ("" : "=r" (var) : "0" (var))
-
 /*
  * A trick to suppress uninitialized variable warning without generating any
  * code
diff --git a/include/linux/compiler-intel.h b/include/linux/compiler-intel.h
index 517bd14e1222..b17f3cd18334 100644
--- a/include/linux/compiler-intel.h
+++ b/include/linux/compiler-intel.h
@@ -5,9 +5,7 @@
 
 #ifdef __ECC
 
-/* Some compiler specific definitions are overwritten here
- * for Intel ECC compiler
- */
+/* Compiler specific definitions for Intel ECC compiler */
 
 #include <asm/intrinsics.h>
 
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 06396c1cf127..1ad367b4cd8d 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -152,7 +152,9 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
 #endif
 
 #ifndef OPTIMIZER_HIDE_VAR
-#define OPTIMIZER_HIDE_VAR(var) barrier()
+/* Make the optimizer believe the variable can be manipulated arbitrarily. */
+#define OPTIMIZER_HIDE_VAR(var)						\
+	__asm__ ("" : "=r" (var) : "0" (var))
 #endif
 
 /* Not-quite-unique ID. */
-- 
MST

^ permalink raw reply related

* [PATCH RFC 2/4] include/linux/compiler.h: allow memory operands
From: Michael S. Tsirkin @ 2019-01-02 20:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrea Parri, linux-arch, Paul E. McKenney, Peter Zijlstra,
	Daniel Lustig, Akira Yokosawa, Will Deacon, Nicholas Piggin,
	virtualization, David Howells, linux-sparse, Alan Stern, netdev,
	Luc Maranget, Jade Alglave, Boqun Feng, Luc Van Oostenryck
In-Reply-To: <20190102205715.14054-1-mst@redhat.com>

We don't really care whether the variable is in-register
or in-memory. Relax the constraint accordingly.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/compiler.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 1ad367b4cd8d..6601d39e8c48 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -154,7 +154,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
 #ifndef OPTIMIZER_HIDE_VAR
 /* Make the optimizer believe the variable can be manipulated arbitrarily. */
 #define OPTIMIZER_HIDE_VAR(var)						\
-	__asm__ ("" : "=r" (var) : "0" (var))
+	__asm__ ("" : "=rm" (var) : "0" (var))
 #endif
 
 /* Not-quite-unique ID. */
-- 
MST

^ permalink raw reply related

* [PATCH RFC 3/4] barriers: convert a control to a data dependency
From: Michael S. Tsirkin @ 2019-01-02 20:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrea Parri, linux-doc, Peter Zijlstra, Akira Yokosawa,
	Will Deacon, virtualization, David Howells, linux-arch,
	Jonathan Corbet, linux-sparse, Alan Stern, Matt Turner,
	Paul E. McKenney, Boqun Feng, Arnd Bergmann, Daniel Lustig,
	Nicholas Piggin, Ivan Kokshaysky, Luc Maranget, Richard Henderson,
	Jade Alglave, netdev, linux-alpha, Luc Van Oostenryck
In-Reply-To: <20190102205715.14054-1-mst@redhat.com>

It's not uncommon to have two access two unrelated memory locations in a
specific order.  At the moment one has to use a memory barrier for this.

However, if the first access was a read and the second used an address
depending on the first one we would have a data dependency and no
barrier would be necessary.

This adds a new interface: dependent_ptr_mb which does exactly this: it
returns a pointer with a data dependency on the supplied value.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 Documentation/memory-barriers.txt | 20 ++++++++++++++++++++
 arch/alpha/include/asm/barrier.h  |  1 +
 include/asm-generic/barrier.h     | 18 ++++++++++++++++++
 include/linux/compiler.h          |  4 ++++
 4 files changed, 43 insertions(+)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index c1d913944ad8..9dbaa2e1dbf6 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -691,6 +691,18 @@ case what's actually required is:
 		p = READ_ONCE(b);
 	}
 
+Alternatively, a control dependency can be converted to a data dependency,
+e.g.:
+
+	q = READ_ONCE(a);
+	if (q) {
+		b = dependent_ptr_mb(b, q);
+		p = READ_ONCE(b);
+	}
+
+Note how the result of dependent_ptr_mb must be used with the following
+accesses in order to have an effect.
+
 However, stores are not speculated.  This means that ordering -is- provided
 for load-store control dependencies, as in the following example:
 
@@ -836,6 +848,12 @@ out-guess your code.  More generally, although READ_ONCE() does force
 the compiler to actually emit code for a given load, it does not force
 the compiler to use the results.
 
+Converting to a data dependency helps with this too:
+
+	q = READ_ONCE(a);
+	b = dependent_ptr_mb(b, q);
+	WRITE_ONCE(b, 1);
+
 In addition, control dependencies apply only to the then-clause and
 else-clause of the if-statement in question.  In particular, it does
 not necessarily apply to code following the if-statement:
@@ -875,6 +893,8 @@ to the CPU containing it.  See the section on "Multicopy atomicity"
 for more information.
 
 
+
+
 In summary:
 
   (*) Control dependencies can order prior loads against later stores.
diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
index 92ec486a4f9e..b4934e8c551b 100644
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -59,6 +59,7 @@
  * as Alpha, "y" could be set to 3 and "x" to 0.  Use rmb()
  * in cases like this where there are no data dependencies.
  */
+#define ARCH_NEEDS_READ_BARRIER_DEPENDS 1
 #define read_barrier_depends() __asm__ __volatile__("mb": : :"memory")
 
 #ifdef CONFIG_SMP
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 2cafdbb9ae4c..fa2e2ef72b68 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -70,6 +70,24 @@
 #define __smp_read_barrier_depends()	read_barrier_depends()
 #endif
 
+#if defined(COMPILER_HAS_OPTIMIZER_HIDE_VAR) && \
+	!defined(ARCH_NEEDS_READ_BARRIER_DEPENDS)
+
+#define dependent_ptr_mb(ptr, val) ({					\
+	long dependent_ptr_mb_val = (long)(val);			\
+	long dependent_ptr_mb_ptr = (long)(ptr) - dependent_ptr_mb_val;	\
+									\
+	BUILD_BUG_ON(sizeof(val) > sizeof(long));			\
+	OPTIMIZER_HIDE_VAR(dependent_ptr_mb_val);			\
+	(typeof(ptr))(dependent_ptr_mb_ptr + dependent_ptr_mb_val);	\
+})
+
+#else
+
+#define dependent_ptr_mb(ptr, val) ({ mb(); (ptr); })
+
+#endif
+
 #ifdef CONFIG_SMP
 
 #ifndef smp_mb
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 6601d39e8c48..f599c30f1b28 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -152,9 +152,13 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
 #endif
 
 #ifndef OPTIMIZER_HIDE_VAR
+
 /* Make the optimizer believe the variable can be manipulated arbitrarily. */
 #define OPTIMIZER_HIDE_VAR(var)						\
 	__asm__ ("" : "=rm" (var) : "0" (var))
+
+#define COMPILER_HAS_OPTIMIZER_HIDE_VAR 1
+
 #endif
 
 /* Not-quite-unique ID. */
-- 
MST

^ permalink raw reply related

* [PATCH RFC 4/4] virtio: use dependent_ptr_mb
From: Michael S. Tsirkin @ 2019-01-02 20:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrea Parri, linux-arch, Paul E. McKenney, Peter Zijlstra,
	Daniel Lustig, Akira Yokosawa, Will Deacon, Nicholas Piggin,
	virtualization, David Howells, Alan Stern, netdev, Luc Maranget,
	Jade Alglave, Boqun Feng
In-Reply-To: <20190102205715.14054-1-mst@redhat.com>

Use dependent_ptr_mb which is - on some architectures -
more light-weight than an rmb.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_ring.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 814b395007b2..2d320396eff8 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -702,6 +702,7 @@ void *virtqueue_get_buf_ctx(struct virtqueue *_vq, unsigned int *len,
 	void *ret;
 	unsigned int i;
 	u16 last_used;
+	bool more;
 
 	START_USE(vq);
 
@@ -710,14 +711,15 @@ void *virtqueue_get_buf_ctx(struct virtqueue *_vq, unsigned int *len,
 		return NULL;
 	}
 
-	if (!more_used(vq)) {
+	more = more_used(vq);
+	if (!more) {
 		pr_debug("No more buffers in queue\n");
 		END_USE(vq);
 		return NULL;
 	}
 
 	/* Only get used array entries after they have been exposed by host. */
-	virtio_rmb(vq->weak_barriers);
+	vq = dependent_ptr_mb(vq, more);
 
 	last_used = (vq->last_used_idx & (vq->vring.num - 1));
 	i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
-- 
MST

^ permalink raw reply related

* Re: [PATCH RFC 3/4] barriers: convert a control to a data dependency
From: Matthew Wilcox @ 2019-01-02 21:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Andrea Parri, linux-doc, Peter Zijlstra, Akira Yokosawa,
	Will Deacon, virtualization, David Howells, linux-arch,
	Jonathan Corbet, linux-sparse, Alan Stern, Matt Turner,
	Paul E. McKenney, Boqun Feng, Arnd Bergmann, Daniel Lustig,
	Nicholas Piggin, Ivan Kokshaysky, Luc Maranget, Richard Henderson,
	Jade Alglave, netdev, linux-kernel, linux-alpha
In-Reply-To: <20190102205715.14054-4-mst@redhat.com>

On Wed, Jan 02, 2019 at 03:57:58PM -0500, Michael S. Tsirkin wrote:
> @@ -875,6 +893,8 @@ to the CPU containing it.  See the section on "Multicopy atomicity"
>  for more information.
>  
>  
> +
> +
>  In summary:
>  
>    (*) Control dependencies can order prior loads against later stores.

Was this hunk intentional?

^ permalink raw reply

* [PULL] virtio, vhost: features, fixes, cleanups
From: Michael S. Tsirkin @ 2019-01-02 21:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm, mst, netdev, linux-kernel, virtualization, stefanha,
	pbonzini, changpeng.liu, wangyan122

The following changes since commit 7566ec393f4161572ba6f11ad5171fd5d59b0fbd:

  Linux 4.20-rc7 (2018-12-16 15:46:55 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to a691ffb46edd7cb12a17ff0965ab59dbc95f48de:

  vhost: correct the related warning message (2018-12-19 18:23:50 -0500)

----------------------------------------------------------------
virtio, vhost: features, fixes, cleanups

discard in virtio blk
misc fixes and cleanups

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Changpeng Liu (1):
      virtio_blk: add discard and write zeroes support

Dongli Zhang (1):
      virtio: remove deprecated VIRTIO_PCI_CONFIG()

Michael S. Tsirkin (1):
      virtio: fix test build after uio.h change

Paolo Bonzini (1):
      vhost: split structs into a separate header file

Stefan Hajnoczi (1):
      vhost/vsock: switch to a mutex for vhost_vsock_hash

wangyan (1):
      vhost: correct the related warning message

 drivers/block/virtio_blk.c         |  83 +++++++++++++++++++++++-
 drivers/vhost/scsi.c               |   4 +-
 drivers/vhost/vsock.c              |  16 ++---
 drivers/virtio/virtio_pci_legacy.c |   6 +-
 include/uapi/linux/vhost.h         | 113 +-------------------------------
 include/uapi/linux/vhost_types.h   | 128 +++++++++++++++++++++++++++++++++++++
 include/uapi/linux/virtio_blk.h    |  54 ++++++++++++++++
 tools/virtio/linux/kernel.h        |   4 ++
 8 files changed, 283 insertions(+), 125 deletions(-)
 create mode 100644 include/uapi/linux/vhost_types.h

^ permalink raw reply

* Re: [PATCH RFC 3/4] barriers: convert a control to a data dependency
From: Michael S. Tsirkin @ 2019-01-02 21:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrea Parri, linux-doc, Peter Zijlstra, Akira Yokosawa,
	Will Deacon, virtualization, David Howells, linux-arch,
	Jonathan Corbet, linux-sparse, Alan Stern, Matt Turner,
	Paul E. McKenney, Boqun Feng, Arnd Bergmann, Daniel Lustig,
	Nicholas Piggin, Ivan Kokshaysky, Luc Maranget, Richard Henderson,
	Jade Alglave, netdev, linux-kernel, linux-alpha
In-Reply-To: <20190102210024.GJ6310@bombadil.infradead.org>

On Wed, Jan 02, 2019 at 01:00:24PM -0800, Matthew Wilcox wrote:
> On Wed, Jan 02, 2019 at 03:57:58PM -0500, Michael S. Tsirkin wrote:
> > @@ -875,6 +893,8 @@ to the CPU containing it.  See the section on "Multicopy atomicity"
> >  for more information.
> >  
> >  
> > +
> > +
> >  In summary:
> >  
> >    (*) Control dependencies can order prior loads against later stores.
> 
> Was this hunk intentional?

Nope, thanks for catching this.

-- 
MST

^ permalink raw reply

* Re: [PATCH RFC 0/4] barriers using data dependency
From: Michael S. Tsirkin @ 2019-01-02 23:04 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andrea Parri, linux-arch, Paul E. McKenney, Peter Zijlstra,
	Daniel Lustig, Akira Yokosawa, Will Deacon, linux-kernel,
	Nicholas Piggin, virtualization, David Howells, netdev,
	Luc Maranget, Jade Alglave, Boqun Feng
In-Reply-To: <Pine.LNX.4.44L0.1901021629150.1375-100000@iolanthe.rowland.org>

On Wed, Jan 02, 2019 at 04:36:40PM -0500, Alan Stern wrote:
> On Wed, 2 Jan 2019, Michael S. Tsirkin wrote:
> 
> > So as explained in Documentation/memory-barriers.txt e.g.
> > a load followed by a store require a full memory barrier,
> > to avoid store being ordered before the load.
> > Similarly load-load requires a read memory barrier.
> > 
> > Thinking about it, we can actually create a data dependency
> > by mixing the first loaded value into the pointer being
> > accessed.
> > 
> > This adds an API for this and uses it in virtio.
> > 
> > Written over the holiday and build tested only so far.
> 
> You are using the terminology from memory-barriers.txt, referring to
> the new dependency you create as a data dependency.  However,
> tools/memory-model/* uses a more precise name, calling it an address
> dependency.  Could you change the comments in the patches to use this
> name instead?

Sure, sounds good. While I'm at it, should memory-barriers.txt be
switched over too?

> > This patchset is also suboptimal on e.g. x86 where e.g. smp_rmb is a nop.
> 
> This should be easy to fix with an architecture-specific override.
> 
> Alan Stern

Absolutely. It does however mean that we'll need several
variants: mb/rmb, smp/dma/virt/mandatory.

I am still trying to decide whether it's good since it documents the
kind of barrier that we are trying to use - or bad since it's more
verbose and makes you choose one where they are all pretty cheap.

> > Sending out for early feedback/flames.
> > 
> > Michael S. Tsirkin (4):
> >   include/linux/compiler*.h: fix OPTIMIZER_HIDE_VAR
> >   include/linux/compiler.h: allow memory operands
> >   barriers: convert a control to a data dependency
> >   virtio: use dependent_ptr_mb
> > 
> >  Documentation/memory-barriers.txt | 20 ++++++++++++++++++++
> >  arch/alpha/include/asm/barrier.h  |  1 +
> >  drivers/virtio/virtio_ring.c      |  6 ++++--
> >  include/asm-generic/barrier.h     | 18 ++++++++++++++++++
> >  include/linux/compiler-clang.h    |  5 ++---
> >  include/linux/compiler-gcc.h      |  4 ----
> >  include/linux/compiler-intel.h    |  4 +---
> >  include/linux/compiler.h          |  8 +++++++-
> >  8 files changed, 53 insertions(+), 13 deletions(-)

^ permalink raw reply

* [PATCH v1 0/2] virtio-balloon: tweak config_changed
From: Wei Wang @ 2019-01-03  5:31 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, virtualization, kvm, mst, cohuck
  Cc: pasic, dgilbert, pbonzini

Since virtio-ccw doesn't work with accessing to the config registers
inside an interrupt context, this patch series avoids that issue by
moving the config register accesses to the related workqueue contexts.

Wei Wang (2):
  virtio-balloon: tweak config_changed implementation
  virtio-balloon: improve update_balloon_size_func

 drivers/virtio/virtio_balloon.c | 59 +++++++++++++++++++++--------------------
 1 file changed, 30 insertions(+), 29 deletions(-)

-- 
2.7.4

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox