Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH net-next 04/11] tuntap: simplify error handling in tun_build_skb()
From: Michael S. Tsirkin @ 2018-09-07 14:17 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <e1f17f51-5f99-df18-4185-6c6e4dfaba89@redhat.com>

On Fri, Sep 07, 2018 at 11:22:00AM +0800, Jason Wang wrote:
> > > @@ -1668,6 +1668,9 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
> > >   	if (copied != len)
> > >   		return ERR_PTR(-EFAULT);
> > > +	get_page(alloc_frag->page);
> > > +	alloc_frag->offset += buflen;
> > > +
> > This adds an atomic op on XDP_DROP which is a data path
> > operation for some workloads.
> 
> Yes, I have patch on top to amortize this, the idea is to have a very big
> refcount once after the frag was allocated and maintain a bias and decrease
> them all when allocating new frags.'

Why bother with refcounting for a drop though? It should be simple.

-- 
MST

^ permalink raw reply

* Re: [PATCH net-next 11/11] vhost_net: batch submitting XDP buffers to underlayer sockets
From: Michael S. Tsirkin @ 2018-09-07 16:13 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <ffb802e6-1505-e01e-f6d5-11cde8dace9b@redhat.com>

On Fri, Sep 07, 2018 at 03:41:52PM +0800, Jason Wang wrote:
> > > @@ -556,10 +667,14 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
> > >   	size_t len, total_len = 0;
> > >   	int err;
> > >   	int sent_pkts = 0;
> > > +	bool bulking = (sock->sk->sk_sndbuf == INT_MAX);
> > What does bulking mean?
> 
> The name is misleading, it means whether we can do batching. For simplicity,
> I disable batching is sndbuf is not INT_MAX.

But what does batching have to do with sndbuf?

> > >   	for (;;) {
> > >   		bool busyloop_intr = false;
> > > +		if (nvq->done_idx == VHOST_NET_BATCH)
> > > +			vhost_tx_batch(net, nvq, sock, &msg);
> > > +
> > >   		head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
> > >   				   &busyloop_intr);
> > >   		/* On error, stop handling until the next kick. */
> > > @@ -577,14 +692,34 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
> > >   			break;
> > >   		}
> > > -		vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
> > > -		vq->heads[nvq->done_idx].len = 0;
> > > -
> > >   		total_len += len;
> > > -		if (tx_can_batch(vq, total_len))
> > > -			msg.msg_flags |= MSG_MORE;
> > > -		else
> > > -			msg.msg_flags &= ~MSG_MORE;
> > > +
> > > +		/* For simplicity, TX batching is only enabled if
> > > +		 * sndbuf is unlimited.
> > What if sndbuf changes while this processing is going on?
> 
> We will get the correct sndbuf in the next run of handle_tx(). I think this
> is safe.

If it's safe why bother with special-casing INT_MAX?

-- 
MST

^ permalink raw reply

* [PATCH] virtio: add unlikely() to WARN_ON_ONCE()
From: Igor Stoppa @ 2018-09-07 17:48 UTC (permalink / raw)
  To: Michael S . Tsirkin; +Cc: linux-kernel, Igor Stoppa, Virtualization

The condition to test is unlikely() to be true. Add the hint.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Virtualization@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
---
 tools/virtio/linux/kernel.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/virtio/linux/kernel.h b/tools/virtio/linux/kernel.h
index fb22bccfbc8a..2b5e417ea2f0 100644
--- a/tools/virtio/linux/kernel.h
+++ b/tools/virtio/linux/kernel.h
@@ -123,7 +123,7 @@ static inline void free_page(unsigned long addr)
 #define dev_err(dev, format, ...) fprintf (stderr, format, ## __VA_ARGS__)
 #define dev_warn(dev, format, ...) fprintf (stderr, format, ## __VA_ARGS__)
 
-#define WARN_ON_ONCE(cond) ((cond) ? fprintf (stderr, "WARNING\n") : 0)
+#define WARN_ON_ONCE(cond) (unlikely(cond) ? fprintf (stderr, "WARNING\n") : 0)
 
 #define min(x, y) ({				\
 	typeof(x) _min1 = (x);			\
-- 
2.17.1

^ permalink raw reply related

* [PATCH v2] x86/paravirt: Get rid of patch_site and patch_default labels
From: Borislav Petkov @ 2018-09-07 20:10 UTC (permalink / raw)
  To: LKML; +Cc: Juergen Gross, x86, virtualization
In-Reply-To: <20180907105112.GD12849@zn.tnic>

On Fri, Sep 07, 2018 at 12:51:12PM +0200, Borislav Petkov wrote:
> Whoops, no, it is still being used. Lemme see if I can remove all the
> labels in that function.

Ok, I think I've gotten rid of them all, in both 32- and 64-bit
versions. It boots here and a bunch of all*config builds pass.

---
From: Borislav Petkov <bp@suse.de>
Date: Fri, 7 Sep 2018 12:47:10 +0200
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When CONFIG_PARAVIRT_SPINLOCKS=n, it fires

  arch/x86/kernel/paravirt_patch_64.c: In function ‘native_patch’:
  arch/x86/kernel/paravirt_patch_64.c:89:1: warning: label ‘patch_site’ defined but not used [-Wunused-label]
   patch_site:

but those labels can simply be removed by directly calling the
respective functions there.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: x86@kernel.org
Cc: virtualization@lists.linux-foundation.org
---
 arch/x86/kernel/paravirt_patch_32.c | 21 ++++++++++-----------
 arch/x86/kernel/paravirt_patch_64.c | 20 +++++++++-----------
 2 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/paravirt_patch_32.c b/arch/x86/kernel/paravirt_patch_32.c
index d460cbcabcfe..af57f0d0789f 100644
--- a/arch/x86/kernel/paravirt_patch_32.c
+++ b/arch/x86/kernel/paravirt_patch_32.c
@@ -34,14 +34,16 @@ extern bool pv_is_native_vcpu_is_preempted(void);
 
 unsigned native_patch(u8 type, void *ibuf, unsigned long addr, unsigned len)
 {
-	const unsigned char *start, *end;
+	const unsigned char *start __maybe_unused, *end __maybe_unused;
 	unsigned ret;
 
 #define PATCH_SITE(ops, x)					\
 		case PARAVIRT_PATCH(ops.x):			\
 			start = start_##ops##_##x;		\
 			end = end_##ops##_##x;			\
-			goto patch_site
+								\
+			return paravirt_patch_insns(ibuf, len, start, end);
+
 	switch (type) {
 #ifdef CONFIG_PARAVIRT_XXL
 		PATCH_SITE(irq, irq_disable);
@@ -58,27 +60,24 @@ unsigned native_patch(u8 type, void *ibuf, unsigned long addr, unsigned len)
 			if (pv_is_native_spin_unlock()) {
 				start = start_lock_queued_spin_unlock;
 				end   = end_lock_queued_spin_unlock;
-				goto patch_site;
+
+				return paravirt_patch_insns(ibuf, len, start, end);
 			}
-			goto patch_default;
+			return paravirt_patch_default(type, ibuf, addr, len);
 
 		case PARAVIRT_PATCH(lock.vcpu_is_preempted):
 			if (pv_is_native_vcpu_is_preempted()) {
 				start = start_lock_vcpu_is_preempted;
 				end   = end_lock_vcpu_is_preempted;
-				goto patch_site;
+
+				return paravirt_patch_insns(ibuf, len, start, end);
 			}
-			goto patch_default;
+			return paravirt_patch_default(type, ibuf, addr, len);
 #endif
 
 	default:
-patch_default: __maybe_unused
 		ret = paravirt_patch_default(type, ibuf, addr, len);
 		break;
-
-patch_site:
-		ret = paravirt_patch_insns(ibuf, len, start, end);
-		break;
 	}
 #undef PATCH_SITE
 	return ret;
diff --git a/arch/x86/kernel/paravirt_patch_64.c b/arch/x86/kernel/paravirt_patch_64.c
index 5ad5bcda9dc6..2db6c615932f 100644
--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -42,14 +42,15 @@ extern bool pv_is_native_vcpu_is_preempted(void);
 
 unsigned native_patch(u8 type, void *ibuf, unsigned long addr, unsigned len)
 {
-	const unsigned char *start, *end;
+	const unsigned char *start __maybe_unused, *end __maybe_unused;
 	unsigned ret;
 
 #define PATCH_SITE(ops, x)					\
 		case PARAVIRT_PATCH(ops.x):			\
 			start = start_##ops##_##x;		\
 			end = end_##ops##_##x;			\
-			goto patch_site
+								\
+			return paravirt_patch_insns(ibuf, len, start, end);
 	switch(type) {
 #ifdef CONFIG_PARAVIRT_XXL
 		PATCH_SITE(irq, restore_fl);
@@ -68,27 +69,24 @@ unsigned native_patch(u8 type, void *ibuf, unsigned long addr, unsigned len)
 			if (pv_is_native_spin_unlock()) {
 				start = start_lock_queued_spin_unlock;
 				end   = end_lock_queued_spin_unlock;
-				goto patch_site;
+
+				return paravirt_patch_insns(ibuf, len, start, end);
 			}
-			goto patch_default;
+			return paravirt_patch_default(type, ibuf, addr, len);
 
 		case PARAVIRT_PATCH(lock.vcpu_is_preempted):
 			if (pv_is_native_vcpu_is_preempted()) {
 				start = start_lock_vcpu_is_preempted;
 				end   = end_lock_vcpu_is_preempted;
-				goto patch_site;
+
+				return paravirt_patch_insns(ibuf, len, start, end);
 			}
-			goto patch_default;
+			return paravirt_patch_default(type, ibuf, addr, len);
 #endif
 
 	default:
-patch_default: __maybe_unused
 		ret = paravirt_patch_default(type, ibuf, addr, len);
 		break;
-
-patch_site:
-		ret = paravirt_patch_insns(ibuf, len, start, end);
-		break;
 	}
 #undef PATCH_SITE
 	return ret;
-- 
2.17.0.582.gccdcbd54c

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related

* RE: [PATCH v36 0/5] Virtio-balloon: support free page reporting
From: Wang, Wei W @ 2018-09-08  1:46 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	Michael S. Tsirkin, nilal@redhat.com,
	liliang.opensource@gmail.com, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, linux-mm@kvack.org,
	pbonzini@redhat.com, akpm@linux-foundation.org, mhocko@kernel.org,
	torvalds@linux-foundation.org
In-Reply-To: <20180907122955.GD2544@work-vm>

On Friday, September 7, 2018 8:30 PM, Dr. David Alan Gilbert wrote:
> OK, that's much better.
> The ~50% reducton with a 8G VM and a real workload is great, and it does
> what you expect when you put a lot more RAM in and see the 84% reduction
> on a guest with 128G RAM - 54s vs ~9s is a big win!
> 
> (The migrate_set_speed is a bit high, since that's in bytes/s - but it's not
> important).
> 
> That looks good,
> 

Thanks Dave for the feedback.
Hope you can join our discussion and review the v37 patches as well.

Best,
Wei

^ permalink raw reply

* Re: [PATCH v2] x86/paravirt: Get rid of patch_site and patch_default labels
From: Juergen Gross @ 2018-09-08  5:45 UTC (permalink / raw)
  To: Borislav Petkov, LKML; +Cc: x86, virtualization
In-Reply-To: <20180907201050.GF12849@zn.tnic>

On 07/09/18 22:10, Borislav Petkov wrote:
> On Fri, Sep 07, 2018 at 12:51:12PM +0200, Borislav Petkov wrote:
>> Whoops, no, it is still being used. Lemme see if I can remove all the
>> labels in that function.
> 
> Ok, I think I've gotten rid of them all, in both 32- and 64-bit
> versions. It boots here and a bunch of all*config builds pass.
> 
> ---
> From: Borislav Petkov <bp@suse.de>
> Date: Fri, 7 Sep 2018 12:47:10 +0200
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> When CONFIG_PARAVIRT_SPINLOCKS=n, it fires
> 
>   arch/x86/kernel/paravirt_patch_64.c: In function ‘native_patch’:
>   arch/x86/kernel/paravirt_patch_64.c:89:1: warning: label ‘patch_site’ defined but not used [-Wunused-label]
>    patch_site:
> 
> but those labels can simply be removed by directly calling the
> respective functions there.
> 
> Signed-off-by: Borislav Petkov <bp@suse.de>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: x86@kernel.org
> Cc: virtualization@lists.linux-foundation.org
> ---
>  arch/x86/kernel/paravirt_patch_32.c | 21 ++++++++++-----------
>  arch/x86/kernel/paravirt_patch_64.c | 20 +++++++++-----------
>  2 files changed, 19 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/kernel/paravirt_patch_32.c b/arch/x86/kernel/paravirt_patch_32.c
> index d460cbcabcfe..af57f0d0789f 100644
> --- a/arch/x86/kernel/paravirt_patch_32.c
> +++ b/arch/x86/kernel/paravirt_patch_32.c
> @@ -34,14 +34,16 @@ extern bool pv_is_native_vcpu_is_preempted(void);
>  
>  unsigned native_patch(u8 type, void *ibuf, unsigned long addr, unsigned len)
>  {
> -	const unsigned char *start, *end;
> +	const unsigned char *start __maybe_unused, *end __maybe_unused;
>  	unsigned ret;
>  
>  #define PATCH_SITE(ops, x)					\
>  		case PARAVIRT_PATCH(ops.x):			\
>  			start = start_##ops##_##x;		\
>  			end = end_##ops##_##x;			\
> -			goto patch_site
> +								\
> +			return paravirt_patch_insns(ibuf, len, start, end);
> +
>  	switch (type) {
>  #ifdef CONFIG_PARAVIRT_XXL
>  		PATCH_SITE(irq, irq_disable);
> @@ -58,27 +60,24 @@ unsigned native_patch(u8 type, void *ibuf, unsigned long addr, unsigned len)
>  			if (pv_is_native_spin_unlock()) {
>  				start = start_lock_queued_spin_unlock;
>  				end   = end_lock_queued_spin_unlock;
> -				goto patch_site;
> +
> +				return paravirt_patch_insns(ibuf, len, start, end);
>  			}
> -			goto patch_default;
> +			return paravirt_patch_default(type, ibuf, addr, len);
>  
>  		case PARAVIRT_PATCH(lock.vcpu_is_preempted):
>  			if (pv_is_native_vcpu_is_preempted()) {
>  				start = start_lock_vcpu_is_preempted;
>  				end   = end_lock_vcpu_is_preempted;
> -				goto patch_site;
> +
> +				return paravirt_patch_insns(ibuf, len, start, end);
>  			}
> -			goto patch_default;
> +			return paravirt_patch_default(type, ibuf, addr, len);
>  #endif
>  
>  	default:
> -patch_default: __maybe_unused
>  		ret = paravirt_patch_default(type, ibuf, addr, len);
>  		break;
> -
> -patch_site:
> -		ret = paravirt_patch_insns(ibuf, len, start, end);
> -		break;
>  	}
>  #undef PATCH_SITE
>  	return ret;
> diff --git a/arch/x86/kernel/paravirt_patch_64.c b/arch/x86/kernel/paravirt_patch_64.c
> index 5ad5bcda9dc6..2db6c615932f 100644
> --- a/arch/x86/kernel/paravirt_patch_64.c
> +++ b/arch/x86/kernel/paravirt_patch_64.c
> @@ -42,14 +42,15 @@ extern bool pv_is_native_vcpu_is_preempted(void);
>  
>  unsigned native_patch(u8 type, void *ibuf, unsigned long addr, unsigned len)
>  {
> -	const unsigned char *start, *end;
> +	const unsigned char *start __maybe_unused, *end __maybe_unused;

Drop start and end and ...

>  	unsigned ret;
>  
>  #define PATCH_SITE(ops, x)					\
>  		case PARAVIRT_PATCH(ops.x):			\
>  			start = start_##ops##_##x;		\
>  			end = end_##ops##_##x;			\
> -			goto patch_site
> +								\
> +			return paravirt_patch_insns(ibuf, len, start, end);

... do: return paravirt_patch_insns(ibuf, len, start_##ops##_##x,
end_##ops##_##x)

and please omit the semicolon after the last instruction in the macro.

(below the same)

>  	switch(type) {
>  #ifdef CONFIG_PARAVIRT_XXL
>  		PATCH_SITE(irq, restore_fl);
> @@ -68,27 +69,24 @@ unsigned native_patch(u8 type, void *ibuf, unsigned long addr, unsigned len)
>  			if (pv_is_native_spin_unlock()) {
>  				start = start_lock_queued_spin_unlock;
>  				end   = end_lock_queued_spin_unlock;
> -				goto patch_site;
> +
> +				return paravirt_patch_insns(ibuf, len, start, end);
>  			}
> -			goto patch_default;
> +			return paravirt_patch_default(type, ibuf, addr, len);
>  
>  		case PARAVIRT_PATCH(lock.vcpu_is_preempted):
>  			if (pv_is_native_vcpu_is_preempted()) {
>  				start = start_lock_vcpu_is_preempted;
>  				end   = end_lock_vcpu_is_preempted;
> -				goto patch_site;
> +
> +				return paravirt_patch_insns(ibuf, len, start, end);
>  			}
> -			goto patch_default;
> +			return paravirt_patch_default(type, ibuf, addr, len);
>  #endif
>  
>  	default:
> -patch_default: __maybe_unused
>  		ret = paravirt_patch_default(type, ibuf, addr, len);
>  		break;
> -
> -patch_site:
> -		ret = paravirt_patch_insns(ibuf, len, start, end);
> -		break;
>  	}
>  #undef PATCH_SITE
>  	return ret;
> 

Juergen
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [PATCH] x86/paravirt: Cleanup native_patch()
From: Borislav Petkov @ 2018-09-08 15:28 UTC (permalink / raw)
  To: Juergen Gross; +Cc: x86, LKML, virtualization
In-Reply-To: <65bb69fa-855f-9efc-19ba-c0fab69f468c@suse.com>

When CONFIG_PARAVIRT_SPINLOCKS=n, it fires

  arch/x86/kernel/paravirt_patch_64.c: In function ‘native_patch’:
  arch/x86/kernel/paravirt_patch_64.c:89:1: warning: label ‘patch_site’ defined but not used [-Wunused-label]
   patch_site:

but those labels can simply be removed by directly calling the
respective functions there.

Get rid of local variables too, while at it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: x86@kernel.org
Cc: virtualization@lists.linux-foundation.org
---

It looks even more cleaner now. :)

 arch/x86/kernel/paravirt_patch_32.c | 48 +++++++++++-----------------
 arch/x86/kernel/paravirt_patch_64.c | 49 ++++++++++++-----------------
 2 files changed, 39 insertions(+), 58 deletions(-)

diff --git a/arch/x86/kernel/paravirt_patch_32.c b/arch/x86/kernel/paravirt_patch_32.c
index d460cbcabcfe..0865323c2716 100644
--- a/arch/x86/kernel/paravirt_patch_32.c
+++ b/arch/x86/kernel/paravirt_patch_32.c
@@ -34,14 +34,10 @@ extern bool pv_is_native_vcpu_is_preempted(void);
 
 unsigned native_patch(u8 type, void *ibuf, unsigned long addr, unsigned len)
 {
-	const unsigned char *start, *end;
-	unsigned ret;
-
 #define PATCH_SITE(ops, x)					\
-		case PARAVIRT_PATCH(ops.x):			\
-			start = start_##ops##_##x;		\
-			end = end_##ops##_##x;			\
-			goto patch_site
+	case PARAVIRT_PATCH(ops.x):				\
+		return paravirt_patch_insns(ibuf, len, start_##ops##_##x, end_##ops##_##x)
+
 	switch (type) {
 #ifdef CONFIG_PARAVIRT_XXL
 		PATCH_SITE(irq, irq_disable);
@@ -54,32 +50,26 @@ unsigned native_patch(u8 type, void *ibuf, unsigned long addr, unsigned len)
 		PATCH_SITE(mmu, write_cr3);
 #endif
 #if defined(CONFIG_PARAVIRT_SPINLOCKS)
-		case PARAVIRT_PATCH(lock.queued_spin_unlock):
-			if (pv_is_native_spin_unlock()) {
-				start = start_lock_queued_spin_unlock;
-				end   = end_lock_queued_spin_unlock;
-				goto patch_site;
-			}
-			goto patch_default;
+	case PARAVIRT_PATCH(lock.queued_spin_unlock):
+		if (pv_is_native_spin_unlock())
+			return paravirt_patch_insns(ibuf, len,
+						    start_lock_queued_spin_unlock,
+						    end_lock_queued_spin_unlock);
+		else
+			return paravirt_patch_default(type, ibuf, addr, len);
 
-		case PARAVIRT_PATCH(lock.vcpu_is_preempted):
-			if (pv_is_native_vcpu_is_preempted()) {
-				start = start_lock_vcpu_is_preempted;
-				end   = end_lock_vcpu_is_preempted;
-				goto patch_site;
-			}
-			goto patch_default;
+	case PARAVIRT_PATCH(lock.vcpu_is_preempted):
+		if (pv_is_native_vcpu_is_preempted())
+			return paravirt_patch_insns(ibuf, len,
+						    start_lock_vcpu_is_preempted,
+						    end_lock_vcpu_is_preempted);
+		else
+			return paravirt_patch_default(type, ibuf, addr, len);
 #endif
 
 	default:
-patch_default: __maybe_unused
-		ret = paravirt_patch_default(type, ibuf, addr, len);
-		break;
-
-patch_site:
-		ret = paravirt_patch_insns(ibuf, len, start, end);
-		break;
+		return paravirt_patch_default(type, ibuf, addr, len);
 	}
 #undef PATCH_SITE
-	return ret;
+	return 0;
 }
diff --git a/arch/x86/kernel/paravirt_patch_64.c b/arch/x86/kernel/paravirt_patch_64.c
index 5ad5bcda9dc6..3549f5f9d685 100644
--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -42,15 +42,11 @@ extern bool pv_is_native_vcpu_is_preempted(void);
 
 unsigned native_patch(u8 type, void *ibuf, unsigned long addr, unsigned len)
 {
-	const unsigned char *start, *end;
-	unsigned ret;
-
 #define PATCH_SITE(ops, x)					\
-		case PARAVIRT_PATCH(ops.x):			\
-			start = start_##ops##_##x;		\
-			end = end_##ops##_##x;			\
-			goto patch_site
-	switch(type) {
+	case PARAVIRT_PATCH(ops.x):				\
+		return paravirt_patch_insns(ibuf, len, start_##ops##_##x, end_##ops##_##x)
+
+	switch (type) {
 #ifdef CONFIG_PARAVIRT_XXL
 		PATCH_SITE(irq, restore_fl);
 		PATCH_SITE(irq, save_fl);
@@ -64,32 +60,27 @@ unsigned native_patch(u8 type, void *ibuf, unsigned long addr, unsigned len)
 		PATCH_SITE(mmu, write_cr3);
 #endif
 #if defined(CONFIG_PARAVIRT_SPINLOCKS)
-		case PARAVIRT_PATCH(lock.queued_spin_unlock):
-			if (pv_is_native_spin_unlock()) {
-				start = start_lock_queued_spin_unlock;
-				end   = end_lock_queued_spin_unlock;
-				goto patch_site;
-			}
-			goto patch_default;
+	case PARAVIRT_PATCH(lock.queued_spin_unlock):
+		if (pv_is_native_spin_unlock())
+			return paravirt_patch_insns(ibuf, len,
+						    start_lock_queued_spin_unlock,
+						    end_lock_queued_spin_unlock);
+		else
+			return paravirt_patch_default(type, ibuf, addr, len);
 
-		case PARAVIRT_PATCH(lock.vcpu_is_preempted):
-			if (pv_is_native_vcpu_is_preempted()) {
-				start = start_lock_vcpu_is_preempted;
-				end   = end_lock_vcpu_is_preempted;
-				goto patch_site;
-			}
-			goto patch_default;
+	case PARAVIRT_PATCH(lock.vcpu_is_preempted):
+		if (pv_is_native_vcpu_is_preempted())
+			return paravirt_patch_insns(ibuf, len,
+						    start_lock_vcpu_is_preempted,
+						    end_lock_vcpu_is_preempted);
+		else
+			return paravirt_patch_default(type, ibuf, addr, len);
 #endif
 
 	default:
-patch_default: __maybe_unused
-		ret = paravirt_patch_default(type, ibuf, addr, len);
-		break;
-
-patch_site:
-		ret = paravirt_patch_insns(ibuf, len, start, end);
+		return paravirt_patch_default(type, ibuf, addr, len);
 		break;
 	}
 #undef PATCH_SITE
-	return ret;
+	return 0;
 }
-- 
2.17.0.582.gccdcbd54c


-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related

* Re: [PATCH net-next v8 5/7] net: vhost: introduce bitmap for vhost_poll
From: Tonghao Zhang @ 2018-09-09 10:58 UTC (permalink / raw)
  To: jasowang; +Cc: Linux Kernel Network Developers, virtualization, mst
In-Reply-To: <24f10821-a07e-e0cd-0266-b33856e7d88a@redhat.com>

On Tue, Aug 21, 2018 at 8:45 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
>
> On 2018年08月19日 20:11, xiangxia.m.yue@gmail.com wrote:
> > From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> >
> > The bitmap of vhost_dev can help us to check if the
> > specified poll is scheduled. This patch will be used
> > for next two patches.
> >
> > Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> > ---
> >   drivers/vhost/net.c   | 11 +++++++++--
> >   drivers/vhost/vhost.c | 17 +++++++++++++++--
> >   drivers/vhost/vhost.h |  7 ++++++-
> >   3 files changed, 30 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 1eff72d..23d7ffc 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -1135,8 +1135,15 @@ static int vhost_net_open(struct inode *inode, struct file *f)
> >       }
> >       vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
> >
> > -     vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev);
> > -     vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
> > +     vhost_poll_init(n->poll + VHOST_NET_VQ_TX,
> > +                     handle_tx_net,
> > +                     VHOST_NET_VQ_TX,
> > +                     EPOLLOUT, dev);
> > +
> > +     vhost_poll_init(n->poll + VHOST_NET_VQ_RX,
> > +                     handle_rx_net,
> > +                     VHOST_NET_VQ_RX,
> > +                     EPOLLIN, dev);
> >
> >       f->private_data = n;
> >
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index a1c06e7..dc88a60 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -186,7 +186,7 @@ void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
> >
> >   /* Init poll structure */
> >   void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> > -                  __poll_t mask, struct vhost_dev *dev)
> > +                  __u8 poll_id, __poll_t mask, struct vhost_dev *dev)
> >   {
> >       init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
> >       init_poll_funcptr(&poll->table, vhost_poll_func);
> > @@ -194,6 +194,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> >       poll->dev = dev;
> >       poll->wqh = NULL;
> >
> > +     poll->poll_id = poll_id;
> >       vhost_work_init(&poll->work, fn);
> >   }
> >   EXPORT_SYMBOL_GPL(vhost_poll_init);
> > @@ -276,8 +277,16 @@ bool vhost_has_work(struct vhost_dev *dev)
> >   }
> >   EXPORT_SYMBOL_GPL(vhost_has_work);
> >
> > +bool vhost_has_work_pending(struct vhost_dev *dev, int poll_id)
> > +{
> > +     return !llist_empty(&dev->work_list) &&
> > +             test_bit(poll_id, dev->work_pending);
>
> I think we've already had something similar. E.g can we test
> VHOST_WORK_QUEUED instead?
done, thanks. I am busy two weeks ago, and now I focus on these patches.
> Thanks
>
> > +}
> > +EXPORT_SYMBOL_GPL(vhost_has_work_pending);
> > +
> >   void vhost_poll_queue(struct vhost_poll *poll)
> >   {
> > +     set_bit(poll->poll_id, poll->dev->work_pending);
> >       vhost_work_queue(poll->dev, &poll->work);
> >   }
> >   EXPORT_SYMBOL_GPL(vhost_poll_queue);
> > @@ -354,6 +363,7 @@ static int vhost_worker(void *data)
> >               if (!node)
> >                       schedule();
> >
> > +             bitmap_zero(dev->work_pending, VHOST_DEV_MAX_VQ);
> >               node = llist_reverse_order(node);
> >               /* make sure flag is seen after deletion */
> >               smp_wmb();
> > @@ -420,6 +430,8 @@ void vhost_dev_init(struct vhost_dev *dev,
> >       struct vhost_virtqueue *vq;
> >       int i;
> >
> > +     BUG_ON(nvqs > VHOST_DEV_MAX_VQ);
> > +
> >       dev->vqs = vqs;
> >       dev->nvqs = nvqs;
> >       mutex_init(&dev->mutex);
> > @@ -428,6 +440,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> >       dev->iotlb = NULL;
> >       dev->mm = NULL;
> >       dev->worker = NULL;
> > +     bitmap_zero(dev->work_pending, VHOST_DEV_MAX_VQ);
> >       init_llist_head(&dev->work_list);
> >       init_waitqueue_head(&dev->wait);
> >       INIT_LIST_HEAD(&dev->read_list);
> > @@ -445,7 +458,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> >               vhost_vq_reset(dev, vq);
> >               if (vq->handle_kick)
> >                       vhost_poll_init(&vq->poll, vq->handle_kick,
> > -                                     EPOLLIN, dev);
> > +                                     i, EPOLLIN, dev);
> >       }
> >   }
> >   EXPORT_SYMBOL_GPL(vhost_dev_init);
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index 6c844b9..60b6f6d 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -30,6 +30,7 @@ struct vhost_poll {
> >       wait_queue_head_t        *wqh;
> >       wait_queue_entry_t              wait;
> >       struct vhost_work         work;
> > +     __u8                      poll_id;
> >       __poll_t                  mask;
> >       struct vhost_dev         *dev;
> >   };
> > @@ -37,9 +38,10 @@ struct vhost_poll {
> >   void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
> >   void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
> >   bool vhost_has_work(struct vhost_dev *dev);
> > +bool vhost_has_work_pending(struct vhost_dev *dev, int poll_id);
> >
> >   void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> > -                  __poll_t mask, struct vhost_dev *dev);
> > +                  __u8 id, __poll_t mask, struct vhost_dev *dev);
> >   int vhost_poll_start(struct vhost_poll *poll, struct file *file);
> >   void vhost_poll_stop(struct vhost_poll *poll);
> >   void vhost_poll_flush(struct vhost_poll *poll);
> > @@ -152,6 +154,8 @@ struct vhost_msg_node {
> >     struct list_head node;
> >   };
> >
> > +#define VHOST_DEV_MAX_VQ     128
> > +
> >   struct vhost_dev {
> >       struct mm_struct *mm;
> >       struct mutex mutex;
> > @@ -159,6 +163,7 @@ struct vhost_dev {
> >       int nvqs;
> >       struct eventfd_ctx *log_ctx;
> >       struct llist_head work_list;
> > +     DECLARE_BITMAP(work_pending, VHOST_DEV_MAX_VQ);
> >       struct task_struct *worker;
> >       struct vhost_umem *umem;
> >       struct vhost_umem *iotlb;
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next v8 3/7] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
From: Tonghao Zhang @ 2018-09-09 10:58 UTC (permalink / raw)
  To: jasowang; +Cc: Linux Kernel Network Developers, virtualization, mst
In-Reply-To: <179340e9-f1cd-8378-1a01-61307546003c@redhat.com>

On Tue, Aug 21, 2018 at 11:15 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
>
> On 2018年08月19日 20:11, xiangxia.m.yue@gmail.com wrote:
> > From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> >
> > Factor out generic busy polling logic and will be
> > used for in tx path in the next patch. And with the patch,
> > qemu can set differently the busyloop_timeout for rx queue.
> >
> > To avoid duplicate codes, introduce the helper functions:
> > * sock_has_rx_data(changed from sk_has_rx_data)
> > * vhost_net_busy_poll_try_queue
> >
> > Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> > ---
> >   drivers/vhost/net.c | 111 +++++++++++++++++++++++++++++++++-------------------
> >   1 file changed, 71 insertions(+), 40 deletions(-)
> >
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 32c1b52..453c061 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -440,6 +440,75 @@ static void vhost_net_signal_used(struct vhost_net_virtqueue *nvq)
> >       nvq->done_idx = 0;
> >   }
> >
> > +static int sock_has_rx_data(struct socket *sock)
> > +{
> > +     if (unlikely(!sock))
> > +             return 0;
> > +
> > +     if (sock->ops->peek_len)
> > +             return sock->ops->peek_len(sock);
> > +
> > +     return skb_queue_empty(&sock->sk->sk_receive_queue);
> > +}
> > +
> > +static void vhost_net_busy_poll_try_queue(struct vhost_net *net,
> > +                                       struct vhost_virtqueue *vq)
> > +{
> > +     if (!vhost_vq_avail_empty(&net->dev, vq)) {
> > +             vhost_poll_queue(&vq->poll);
> > +     } else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
> > +             vhost_disable_notify(&net->dev, vq);
> > +             vhost_poll_queue(&vq->poll);
> > +     }
> > +}
> > +
> > +static void vhost_net_busy_poll(struct vhost_net *net,
> > +                             struct vhost_virtqueue *rvq,
> > +                             struct vhost_virtqueue *tvq,
> > +                             bool *busyloop_intr,
> > +                             bool poll_rx)
> > +{
> > +     unsigned long busyloop_timeout;
> > +     unsigned long endtime;
> > +     struct socket *sock;
> > +     struct vhost_virtqueue *vq = poll_rx ? tvq : rvq;
> > +
> > +     mutex_lock_nested(&vq->mutex, poll_rx ? VHOST_NET_VQ_TX: VHOST_NET_VQ_RX);
> > +     vhost_disable_notify(&net->dev, vq);
> > +     sock = rvq->private_data;
> > +
> > +     busyloop_timeout = poll_rx ? rvq->busyloop_timeout:
> > +                                  tvq->busyloop_timeout;
> > +
> > +     preempt_disable();
> > +     endtime = busy_clock() + busyloop_timeout;
> > +
> > +     while (vhost_can_busy_poll(endtime)) {
> > +             if (vhost_has_work(&net->dev)) {
> > +                     *busyloop_intr = true;
> > +                     break;
> > +             }
> > +
> > +             if ((sock_has_rx_data(sock) &&
> > +                  !vhost_vq_avail_empty(&net->dev, rvq)) ||
> > +                 !vhost_vq_avail_empty(&net->dev, tvq))
> > +                     break;
> > +
> > +             cpu_relax();
> > +     }
> > +
> > +     preempt_enable();
> > +
> > +     if (poll_rx)
> > +             vhost_net_busy_poll_try_queue(net, tvq);
> > +     else if (sock_has_rx_data(sock))
> > +             vhost_net_busy_poll_try_queue(net, rvq);
>
> This could be simplified like:
>
> if (poll_rx || sock_has_rx_data(sock))
>      vhost_net_busy_poll_try_queue(net, vq);
done, thanks

> Thanks
>
> > +     else /* On tx here, sock has no rx data. */
> > +             vhost_enable_notify(&net->dev, rvq);
> > +
> > +     mutex_unlock(&vq->mutex);
> > +}
> > +
> >   static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
> >                                   struct vhost_net_virtqueue *nvq,
> >                                   unsigned int *out_num, unsigned int *in_num,
> > @@ -753,16 +822,6 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
> >       return len;
> >   }
> >
> > -static int sk_has_rx_data(struct sock *sk)
> > -{
> > -     struct socket *sock = sk->sk_socket;
> > -
> > -     if (sock->ops->peek_len)
> > -             return sock->ops->peek_len(sock);
> > -
> > -     return skb_queue_empty(&sk->sk_receive_queue);
> > -}
> > -
> >   static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
> >                                     bool *busyloop_intr)
> >   {
> > @@ -770,41 +829,13 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
> >       struct vhost_net_virtqueue *tnvq = &net->vqs[VHOST_NET_VQ_TX];
> >       struct vhost_virtqueue *rvq = &rnvq->vq;
> >       struct vhost_virtqueue *tvq = &tnvq->vq;
> > -     unsigned long uninitialized_var(endtime);
> >       int len = peek_head_len(rnvq, sk);
> >
> > -     if (!len && tvq->busyloop_timeout) {
> > +     if (!len && rvq->busyloop_timeout) {
> >               /* Flush batched heads first */
> >               vhost_net_signal_used(rnvq);
> >               /* Both tx vq and rx socket were polled here */
> > -             mutex_lock_nested(&tvq->mutex, VHOST_NET_VQ_TX);
> > -             vhost_disable_notify(&net->dev, tvq);
> > -
> > -             preempt_disable();
> > -             endtime = busy_clock() + tvq->busyloop_timeout;
> > -
> > -             while (vhost_can_busy_poll(endtime)) {
> > -                     if (vhost_has_work(&net->dev)) {
> > -                             *busyloop_intr = true;
> > -                             break;
> > -                     }
> > -                     if ((sk_has_rx_data(sk) &&
> > -                          !vhost_vq_avail_empty(&net->dev, rvq)) ||
> > -                         !vhost_vq_avail_empty(&net->dev, tvq))
> > -                             break;
> > -                     cpu_relax();
> > -             }
> > -
> > -             preempt_enable();
> > -
> > -             if (!vhost_vq_avail_empty(&net->dev, tvq)) {
> > -                     vhost_poll_queue(&tvq->poll);
> > -             } else if (unlikely(vhost_enable_notify(&net->dev, tvq))) {
> > -                     vhost_disable_notify(&net->dev, tvq);
> > -                     vhost_poll_queue(&tvq->poll);
> > -             }
> > -
> > -             mutex_unlock(&tvq->mutex);
> > +             vhost_net_busy_poll(net, rvq, tvq, busyloop_intr, true);
> >
> >               len = peek_head_len(rnvq, sk);
> >       }
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [PATCH net-next v9 0/6] net: vhost: improve performance when enable busyloop
From: xiangxia.m.yue @ 2018-09-09 11:51 UTC (permalink / raw)
  To: jasowang, mst, makita.toshiaki; +Cc: netdev, virtualization

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

This patches improve the guest receive performance.
On the handle_tx side, we poll the sock receive queue
at the same time. handle_rx do that in the same way.

For more performance report, see patch 4, 5, 6

Tonghao Zhang (6):
  net: vhost: lock the vqs one by one
  net: vhost: replace magic number of lock annotation
  net: vhost: factor out busy polling logic to vhost_net_busy_poll()
  net: vhost: add rx busy polling in tx path
  net: vhost: disable rx wakeup during tx busypoll
  net: vhost: make busyloop_intr more accurate

 drivers/vhost/net.c   | 163 +++++++++++++++++++++++++++++++-------------------
 drivers/vhost/vhost.c |  24 +++-----
 2 files changed, 108 insertions(+), 79 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [PATCH net-next v9 1/6] net: vhost: lock the vqs one by one
From: xiangxia.m.yue @ 2018-09-09 11:51 UTC (permalink / raw)
  To: jasowang, mst, makita.toshiaki; +Cc: netdev, virtualization
In-Reply-To: <1536493887-2637-1-git-send-email-xiangxia.m.yue@gmail.com>

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

This patch changes the way that lock all vqs
at the same, to lock them one by one. It will
be used for next patch to avoid the deadlock.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a502f1a..a1c06e7 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -294,8 +294,11 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
 {
 	int i;
 
-	for (i = 0; i < d->nvqs; ++i)
+	for (i = 0; i < d->nvqs; ++i) {
+		mutex_lock(&d->vqs[i]->mutex);
 		__vhost_vq_meta_reset(d->vqs[i]);
+		mutex_unlock(&d->vqs[i]->mutex);
+	}
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -890,20 +893,6 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
 #define vhost_get_used(vq, x, ptr) \
 	vhost_get_user(vq, x, ptr, VHOST_ADDR_USED)
 
-static void vhost_dev_lock_vqs(struct vhost_dev *d)
-{
-	int i = 0;
-	for (i = 0; i < d->nvqs; ++i)
-		mutex_lock_nested(&d->vqs[i]->mutex, i);
-}
-
-static void vhost_dev_unlock_vqs(struct vhost_dev *d)
-{
-	int i = 0;
-	for (i = 0; i < d->nvqs; ++i)
-		mutex_unlock(&d->vqs[i]->mutex);
-}
-
 static int vhost_new_umem_range(struct vhost_umem *umem,
 				u64 start, u64 size, u64 end,
 				u64 userspace_addr, int perm)
@@ -953,7 +942,10 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
 		if (msg->iova <= vq_msg->iova &&
 		    msg->iova + msg->size - 1 > vq_msg->iova &&
 		    vq_msg->type == VHOST_IOTLB_MISS) {
+			mutex_lock(&node->vq->mutex);
 			vhost_poll_queue(&node->vq->poll);
+			mutex_unlock(&node->vq->mutex);
+
 			list_del(&node->node);
 			kfree(node);
 		}
@@ -985,7 +977,6 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 	int ret = 0;
 
 	mutex_lock(&dev->mutex);
-	vhost_dev_lock_vqs(dev);
 	switch (msg->type) {
 	case VHOST_IOTLB_UPDATE:
 		if (!dev->iotlb) {
@@ -1019,7 +1010,6 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 		break;
 	}
 
-	vhost_dev_unlock_vqs(dev);
 	mutex_unlock(&dev->mutex);
 
 	return ret;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next v9 2/6] net: vhost: replace magic number of lock annotation
From: xiangxia.m.yue @ 2018-09-09 11:51 UTC (permalink / raw)
  To: jasowang, mst, makita.toshiaki; +Cc: netdev, virtualization
In-Reply-To: <1536493887-2637-1-git-send-email-xiangxia.m.yue@gmail.com>

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

Use the VHOST_NET_VQ_XXX as a subclass for mutex_lock_nested.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 367d802..32c1b52 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -712,7 +712,7 @@ static void handle_tx(struct vhost_net *net)
 	struct vhost_virtqueue *vq = &nvq->vq;
 	struct socket *sock;
 
-	mutex_lock(&vq->mutex);
+	mutex_lock_nested(&vq->mutex, VHOST_NET_VQ_TX);
 	sock = vq->private_data;
 	if (!sock)
 		goto out;
@@ -777,7 +777,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
 		/* Flush batched heads first */
 		vhost_net_signal_used(rnvq);
 		/* Both tx vq and rx socket were polled here */
-		mutex_lock_nested(&tvq->mutex, 1);
+		mutex_lock_nested(&tvq->mutex, VHOST_NET_VQ_TX);
 		vhost_disable_notify(&net->dev, tvq);
 
 		preempt_disable();
@@ -919,7 +919,7 @@ static void handle_rx(struct vhost_net *net)
 	__virtio16 num_buffers;
 	int recv_pkts = 0;
 
-	mutex_lock_nested(&vq->mutex, 0);
+	mutex_lock_nested(&vq->mutex, VHOST_NET_VQ_RX);
 	sock = vq->private_data;
 	if (!sock)
 		goto out;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next v9 3/6] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
From: xiangxia.m.yue @ 2018-09-09 11:51 UTC (permalink / raw)
  To: jasowang, mst, makita.toshiaki; +Cc: netdev, virtualization
In-Reply-To: <1536493887-2637-1-git-send-email-xiangxia.m.yue@gmail.com>

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

Factor out generic busy polling logic and will be
used for in tx path in the next patch. And with the patch,
qemu can set differently the busyloop_timeout for rx queue.

To avoid duplicate codes, introduce the helper functions:
* sock_has_rx_data(changed from sk_has_rx_data)
* vhost_net_busy_poll_try_queue

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
 drivers/vhost/net.c | 109 +++++++++++++++++++++++++++++++++-------------------
 1 file changed, 69 insertions(+), 40 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 32c1b52..3f68265 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -440,6 +440,73 @@ static void vhost_net_signal_used(struct vhost_net_virtqueue *nvq)
 	nvq->done_idx = 0;
 }
 
+static int sock_has_rx_data(struct socket *sock)
+{
+	if (unlikely(!sock))
+		return 0;
+
+	if (sock->ops->peek_len)
+		return sock->ops->peek_len(sock);
+
+	return skb_queue_empty(&sock->sk->sk_receive_queue);
+}
+
+static void vhost_net_busy_poll_try_queue(struct vhost_net *net,
+					  struct vhost_virtqueue *vq)
+{
+	if (!vhost_vq_avail_empty(&net->dev, vq)) {
+		vhost_poll_queue(&vq->poll);
+	} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
+		vhost_disable_notify(&net->dev, vq);
+		vhost_poll_queue(&vq->poll);
+	}
+}
+
+static void vhost_net_busy_poll(struct vhost_net *net,
+				struct vhost_virtqueue *rvq,
+				struct vhost_virtqueue *tvq,
+				bool *busyloop_intr,
+				bool poll_rx)
+{
+	unsigned long busyloop_timeout;
+	unsigned long endtime;
+	struct socket *sock;
+	struct vhost_virtqueue *vq = poll_rx ? tvq : rvq;
+
+	mutex_lock_nested(&vq->mutex, poll_rx ? VHOST_NET_VQ_TX: VHOST_NET_VQ_RX);
+	vhost_disable_notify(&net->dev, vq);
+	sock = rvq->private_data;
+
+	busyloop_timeout = poll_rx ? rvq->busyloop_timeout:
+				     tvq->busyloop_timeout;
+
+	preempt_disable();
+	endtime = busy_clock() + busyloop_timeout;
+
+	while (vhost_can_busy_poll(endtime)) {
+		if (vhost_has_work(&net->dev)) {
+			*busyloop_intr = true;
+			break;
+		}
+
+		if ((sock_has_rx_data(sock) &&
+		     !vhost_vq_avail_empty(&net->dev, rvq)) ||
+		    !vhost_vq_avail_empty(&net->dev, tvq))
+			break;
+
+		cpu_relax();
+	}
+
+	preempt_enable();
+
+	if (poll_rx || sock_has_rx_data(sock))
+		vhost_net_busy_poll_try_queue(net, vq);
+	else if (!poll_rx) /* On tx here, sock has no rx data. */
+		vhost_enable_notify(&net->dev, rvq);
+
+	mutex_unlock(&vq->mutex);
+}
+
 static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 				    struct vhost_net_virtqueue *nvq,
 				    unsigned int *out_num, unsigned int *in_num,
@@ -753,16 +820,6 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 	return len;
 }
 
-static int sk_has_rx_data(struct sock *sk)
-{
-	struct socket *sock = sk->sk_socket;
-
-	if (sock->ops->peek_len)
-		return sock->ops->peek_len(sock);
-
-	return skb_queue_empty(&sk->sk_receive_queue);
-}
-
 static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
 				      bool *busyloop_intr)
 {
@@ -770,41 +827,13 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
 	struct vhost_net_virtqueue *tnvq = &net->vqs[VHOST_NET_VQ_TX];
 	struct vhost_virtqueue *rvq = &rnvq->vq;
 	struct vhost_virtqueue *tvq = &tnvq->vq;
-	unsigned long uninitialized_var(endtime);
 	int len = peek_head_len(rnvq, sk);
 
-	if (!len && tvq->busyloop_timeout) {
+	if (!len && rvq->busyloop_timeout) {
 		/* Flush batched heads first */
 		vhost_net_signal_used(rnvq);
 		/* Both tx vq and rx socket were polled here */
-		mutex_lock_nested(&tvq->mutex, VHOST_NET_VQ_TX);
-		vhost_disable_notify(&net->dev, tvq);
-
-		preempt_disable();
-		endtime = busy_clock() + tvq->busyloop_timeout;
-
-		while (vhost_can_busy_poll(endtime)) {
-			if (vhost_has_work(&net->dev)) {
-				*busyloop_intr = true;
-				break;
-			}
-			if ((sk_has_rx_data(sk) &&
-			     !vhost_vq_avail_empty(&net->dev, rvq)) ||
-			    !vhost_vq_avail_empty(&net->dev, tvq))
-				break;
-			cpu_relax();
-		}
-
-		preempt_enable();
-
-		if (!vhost_vq_avail_empty(&net->dev, tvq)) {
-			vhost_poll_queue(&tvq->poll);
-		} else if (unlikely(vhost_enable_notify(&net->dev, tvq))) {
-			vhost_disable_notify(&net->dev, tvq);
-			vhost_poll_queue(&tvq->poll);
-		}
-
-		mutex_unlock(&tvq->mutex);
+		vhost_net_busy_poll(net, rvq, tvq, busyloop_intr, true);
 
 		len = peek_head_len(rnvq, sk);
 	}
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next v9 4/6] net: vhost: add rx busy polling in tx path
From: xiangxia.m.yue @ 2018-09-09 11:51 UTC (permalink / raw)
  To: jasowang, mst, makita.toshiaki; +Cc: netdev, virtualization
In-Reply-To: <1536493887-2637-1-git-send-email-xiangxia.m.yue@gmail.com>

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

This patch improves the guest receive performance.
On the handle_tx side, we poll the sock receive queue at the
same time. handle_rx do that in the same way.

We set the poll-us=100us and use the netperf to test throughput
and mean latency. When running the tests, the vhost-net kthread
of that VM, is alway 100% CPU. The commands are shown as below.

Rx performance is greatly improved by this patch. There is not
notable performance change on tx with this series though. This
patch is useful for bi-directional traffic.

netperf -H IP -t TCP_STREAM -l 20 -- -O "THROUGHPUT, THROUGHPUT_UNITS, MEAN_LATENCY"

Topology:
[Host] ->linux bridge -> tap vhost-net ->[Guest]

TCP_STREAM:
* Without the patch:  19842.95 Mbps, 6.50 us mean latency
* With the patch:     37598.20 Mbps, 3.43 us mean latency

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
 drivers/vhost/net.c | 33 +++++++++++++--------------------
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 3f68265..d372c14 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -508,31 +508,24 @@ static void vhost_net_busy_poll(struct vhost_net *net,
 }
 
 static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
-				    struct vhost_net_virtqueue *nvq,
+				    struct vhost_net_virtqueue *tnvq,
 				    unsigned int *out_num, unsigned int *in_num,
 				    bool *busyloop_intr)
 {
-	struct vhost_virtqueue *vq = &nvq->vq;
-	unsigned long uninitialized_var(endtime);
-	int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+	struct vhost_net_virtqueue *rnvq = &net->vqs[VHOST_NET_VQ_RX];
+	struct vhost_virtqueue *rvq = &rnvq->vq;
+	struct vhost_virtqueue *tvq = &tnvq->vq;
+
+	int r = vhost_get_vq_desc(tvq, tvq->iov, ARRAY_SIZE(tvq->iov),
 				  out_num, in_num, NULL, NULL);
 
-	if (r == vq->num && vq->busyloop_timeout) {
-		if (!vhost_sock_zcopy(vq->private_data))
-			vhost_net_signal_used(nvq);
-		preempt_disable();
-		endtime = busy_clock() + vq->busyloop_timeout;
-		while (vhost_can_busy_poll(endtime)) {
-			if (vhost_has_work(vq->dev)) {
-				*busyloop_intr = true;
-				break;
-			}
-			if (!vhost_vq_avail_empty(vq->dev, vq))
-				break;
-			cpu_relax();
-		}
-		preempt_enable();
-		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+	if (r == tvq->num && tvq->busyloop_timeout) {
+		if (!vhost_sock_zcopy(tvq->private_data))
+			vhost_net_signal_used(tnvq);
+
+		vhost_net_busy_poll(net, rvq, tvq, busyloop_intr, false);
+
+		r = vhost_get_vq_desc(tvq, tvq->iov, ARRAY_SIZE(tvq->iov),
 				      out_num, in_num, NULL, NULL);
 	}
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net-next v9 0/6] net: vhost: improve performance when enable busyloop
From: David Miller @ 2018-09-09 15:01 UTC (permalink / raw)
  To: xiangxia.m.yue; +Cc: netdev, virtualization, mst
In-Reply-To: <1536493887-2637-1-git-send-email-xiangxia.m.yue@gmail.com>

From: xiangxia.m.yue@gmail.com
Date: Sun,  9 Sep 2018 04:51:21 -0700

> From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> 
> This patches improve the guest receive performance.
> On the handle_tx side, we poll the sock receive queue
> at the same time. handle_rx do that in the same way.
> 
> For more performance report, see patch 4, 5, 6

I only see patches 1-4.

^ permalink raw reply

* Re: [PATCH net-next v2 3/5] virtio_ring: add packed ring support
From: Tiwei Bie @ 2018-09-10  2:03 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180907083500-mutt-send-email-mst@kernel.org>

On Fri, Sep 07, 2018 at 09:49:14AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 11, 2018 at 10:27:09AM +0800, Tiwei Bie wrote:
> > This commit introduces the support (without EVENT_IDX) for
> > packed ring.
> > 
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > ---
> >  drivers/virtio/virtio_ring.c | 495 ++++++++++++++++++++++++++++++++++-
> >  1 file changed, 487 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index c4f8abc7445a..f317b485ba54 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -55,12 +55,21 @@
> >  #define END_USE(vq)
> >  #endif
> >  
> > +#define _VRING_DESC_F_AVAIL(b)	((__u16)(b) << 7)
> > +#define _VRING_DESC_F_USED(b)	((__u16)(b) << 15)
> 
> Underscore followed by an upper case letter isn't a good idea
> for a symbol. And it's not nice that this does not
> use the VRING_DESC_F_USED from the header.
> How about ((b) ? VRING_DESC_F_USED : 0) instead?
> Is produced code worse then?

Yes, I think the produced code is worse when we use
conditional expression. Below is a simple test:

#define foo1(b) ((b) << 10)
#define foo2(b) ((b) ? (1 << 10) : 0)

unsigned short bar(unsigned short b)
{
	return foo1(b);
}

unsigned short baz(unsigned short b)
{
	return foo2(b);
}

With `gcc -O3 -S`, I got below assembly code:

	.file	"tmp.c"
	.text
	.p2align 4,,15
	.globl	bar
	.type	bar, @function
bar:
.LFB0:
	.cfi_startproc
	movl	%edi, %eax
	sall	$10, %eax
	ret
	.cfi_endproc
.LFE0:
	.size	bar, .-bar
	.p2align 4,,15
	.globl	baz
	.type	baz, @function
baz:
.LFB1:
	.cfi_startproc
	xorl	%eax, %eax
	testw	%di, %di
	setne	%al
	sall	$10, %eax
	ret
	.cfi_endproc
.LFE1:
	.size	baz, .-baz
	.ident	"GCC: (Debian 8.2.0-4) 8.2.0"
	.section	.note.GNU-stack,"",@progbits

That is to say, for `((b) << 10)`, it will shift the
register directly. But for `((b) ? (1 << 10) : 0)`,
in above code, it will zero eax first, and set al
to 1 depend on whether di is 0, and shift eax.

> 
> > +
> >  struct vring_desc_state {
> >  	void *data;			/* Data for callback. */
> >  	struct vring_desc *indir_desc;	/* Indirect descriptor, if any. */
> >  };
> >  
> >  struct vring_desc_state_packed {
> > +	void *data;			/* Data for callback. */
> > +	struct vring_packed_desc *indir_desc; /* Indirect descriptor, if any. */
> 
> Include vring_desc_state for these?

Sure.

> 
> > +	int num;			/* Descriptor list length. */
> > +	dma_addr_t addr;		/* Buffer DMA addr. */
> > +	u32 len;			/* Buffer length. */
> > +	u16 flags;			/* Descriptor flags. */
> 
> This seems only to be used for indirect. Check indirect field above
> instead and drop this.

The `flags` is also needed to know the direction, i.e. DMA_FROM_DEVICE
or DMA_TO_DEVICE when do DMA unmap.

> 
> >  	int next;			/* The next desc state. */
> 
> Packing things into 16 bit integers will help reduce
> cache pressure here.
> 
> Also, this is only used for unmap, so when dma API is not used,
> maintaining addr and len list management is unnecessary. How about we
> maintain addr/len in a separate array, avoiding accesses on unmap in that case?

Sure. I'll give it a try.

> 
> Also, lots of architectures have a nop unmap in the DMA API.
> See a proposed patch at the end for optimizing that case.

Got it. Thanks.

> 
> >  };
> >  
> > @@ -660,7 +669,6 @@ static bool virtqueue_poll_split(struct virtqueue *_vq, unsigned last_used_idx)
> >  {
> >  	struct vring_virtqueue *vq = to_vvq(_vq);
> >  
> > -	virtio_mb(vq->weak_barriers);
> >  	return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
> >  }
> 
> why is this changing the split queue implementation?

Because above barrier is also needed by virtqueue_poll_packed(),
so I moved it to virtqueue_poll() and also added some comments
for it.

> 
> 
> >  
> > @@ -757,6 +765,72 @@ static inline unsigned vring_size_packed(unsigned int num, unsigned long align)
> >  		& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
> >  }
> >  
> > +static void vring_unmap_state_packed(const struct vring_virtqueue *vq,
> > +				     struct vring_desc_state_packed *state)
> > +{
> > +	u16 flags;
> > +
> > +	if (!vring_use_dma_api(vq->vq.vdev))
> > +		return;
> > +
> > +	flags = state->flags;
> > +
> > +	if (flags & VRING_DESC_F_INDIRECT) {
> > +		dma_unmap_single(vring_dma_dev(vq),
> > +				 state->addr, state->len,
> > +				 (flags & VRING_DESC_F_WRITE) ?
> > +				 DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > +	} else {
> > +		dma_unmap_page(vring_dma_dev(vq),
> > +			       state->addr, state->len,
> > +			       (flags & VRING_DESC_F_WRITE) ?
> > +			       DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > +	}
> > +}
> > +
> > +static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
> > +				   struct vring_packed_desc *desc)
> > +{
> > +	u16 flags;
> > +
> > +	if (!vring_use_dma_api(vq->vq.vdev))
> > +		return;
> > +
> > +	flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
> 
> I see no reason to use virtioXX wrappers for the packed ring.
> That's there to support legacy devices. Just use leXX.

Okay.

> 
> > +
> > +	if (flags & VRING_DESC_F_INDIRECT) {
> > +		dma_unmap_single(vring_dma_dev(vq),
> > +				 virtio64_to_cpu(vq->vq.vdev, desc->addr),
> > +				 virtio32_to_cpu(vq->vq.vdev, desc->len),
> > +				 (flags & VRING_DESC_F_WRITE) ?
> > +				 DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > +	} else {
> > +		dma_unmap_page(vring_dma_dev(vq),
> > +			       virtio64_to_cpu(vq->vq.vdev, desc->addr),
> > +			       virtio32_to_cpu(vq->vq.vdev, desc->len),
> > +			       (flags & VRING_DESC_F_WRITE) ?
> > +			       DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > +	}
> > +}
> > +
> > +static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
> > +						       unsigned int total_sg,
> > +						       gfp_t gfp)
> > +{
> > +	struct vring_packed_desc *desc;
> > +
> > +	/*
> > +	 * We require lowmem mappings for the descriptors because
> > +	 * otherwise virt_to_phys will give us bogus addresses in the
> > +	 * virtqueue.
> 
> Where is virt_to_phys used? I don't see it in this patch.

In vring_map_single(), virt_to_phys() will be used to translate
the address to phys if dma api isn't used:

https://github.com/torvalds/linux/blob/a49a9dcce802/drivers/virtio/virtio_ring.c#L197

> 
> > +	 */
> > +	gfp &= ~__GFP_HIGHMEM;
> > +
> > +	desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
> > +
> > +	return desc;
> > +}
> > +
> >  static inline int virtqueue_add_packed(struct virtqueue *_vq,
> >  				       struct scatterlist *sgs[],
> >  				       unsigned int total_sg,
> > @@ -766,47 +840,449 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
> >  				       void *ctx,
> >  				       gfp_t gfp)
> >  {
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +	struct vring_packed_desc *desc;
> > +	struct scatterlist *sg;
> > +	unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
> > +	__virtio16 uninitialized_var(head_flags), flags;
> > +	u16 head, avail_wrap_counter, id, curr;
> > +	bool indirect;
> > +
> > +	START_USE(vq);
> > +
> > +	BUG_ON(data == NULL);
> > +	BUG_ON(ctx && vq->indirect);
> > +
> > +	if (unlikely(vq->broken)) {
> > +		END_USE(vq);
> > +		return -EIO;
> > +	}
> > +
> > +#ifdef DEBUG
> > +	{
> > +		ktime_t now = ktime_get();
> > +
> > +		/* No kick or get, with .1 second between?  Warn. */
> > +		if (vq->last_add_time_valid)
> > +			WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
> > +					    > 100);
> > +		vq->last_add_time = now;
> > +		vq->last_add_time_valid = true;
> > +	}
> > +#endif
> 
> Add incline helpers for this debug stuff rather than
> duplicate it from split ring?

Sure. I'd like to do that.

> 
> 
> > +
> > +	BUG_ON(total_sg == 0);
> > +
> > +	head = vq->next_avail_idx;
> > +	avail_wrap_counter = vq->avail_wrap_counter;
> > +
> > +	if (virtqueue_use_indirect(_vq, total_sg))
> > +		desc = alloc_indirect_packed(_vq, total_sg, gfp);
> > +	else {
> > +		desc = NULL;
> > +		WARN_ON_ONCE(total_sg > vq->vring_packed.num && !vq->indirect);
> > +	}
> 
> 
> Apparently this attempts to treat indirect descriptors same as
> direct ones. This is how it is in the split ring, but not in
> the packed one. I think you want two separate functions, for
> direct and indirect.

Okay, I'll do that.

> 
> > +
> > +	if (desc) {
> > +		/* Use a single buffer which doesn't continue */
> > +		indirect = true;
> > +		/* Set up rest to use this indirect table. */
> > +		i = 0;
> > +		descs_used = 1;
> > +	} else {
> > +		indirect = false;
> > +		desc = vq->vring_packed.desc;
> > +		i = head;
> > +		descs_used = total_sg;
> > +	}
> > +
> > +	if (vq->vq.num_free < descs_used) {
> > +		pr_debug("Can't add buf len %i - avail = %i\n",
> > +			 descs_used, vq->vq.num_free);
> > +		/* FIXME: for historical reasons, we force a notify here if
> > +		 * there are outgoing parts to the buffer.  Presumably the
> > +		 * host should service the ring ASAP. */
> > +		if (out_sgs)
> > +			vq->notify(&vq->vq);
> > +		if (indirect)
> > +			kfree(desc);
> > +		END_USE(vq);
> > +		return -ENOSPC;
> > +	}
> > +
> > +	id = vq->free_head;
> > +	BUG_ON(id == vq->vring_packed.num);
> > +
> > +	curr = id;
> > +	for (n = 0; n < out_sgs + in_sgs; n++) {
> > +		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> > +			dma_addr_t addr = vring_map_one_sg(vq, sg, n < out_sgs ?
> > +					       DMA_TO_DEVICE : DMA_FROM_DEVICE);
> > +			if (vring_mapping_error(vq, addr))
> > +				goto unmap_release;
> > +
> > +			flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT |
> > +				  (n < out_sgs ? 0 : VRING_DESC_F_WRITE) |
> > +				  _VRING_DESC_F_AVAIL(vq->avail_wrap_counter) |
> > +				  _VRING_DESC_F_USED(!vq->avail_wrap_counter));
> 
> Spec says:
> 	The VIRTQ_DESC_F_WRITE flags bit is the only valid flag for descriptors in the
> 	indirect table.
> 
> All this logic isn't needed for indirect.
> 
> Also, why re-calculate avail/used flags every time? They only change
> when you wrap around.

Will do that. Thanks.

> 
> 
> > +			if (!indirect && i == head)
> > +				head_flags = flags;
> > +			else
> > +				desc[i].flags = flags;
> > +
> > +			desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
> > +			desc[i].len = cpu_to_virtio32(_vq->vdev, sg->length);
> > +			i++;
> > +			if (!indirect) {
> > +				if (vring_use_dma_api(_vq->vdev)) {
> > +					vq->desc_state_packed[curr].addr = addr;
> > +					vq->desc_state_packed[curr].len =
> > +						sg->length;
> > +					vq->desc_state_packed[curr].flags =
> > +						virtio16_to_cpu(_vq->vdev,
> > +								flags);
> > +				}
> > +				curr = vq->desc_state_packed[curr].next;
> > +
> > +				if (i >= vq->vring_packed.num) {
> > +					i = 0;
> > +					vq->avail_wrap_counter ^= 1;
> > +				}
> > +			}
> > +		}
> > +	}
> > +
> > +	prev = (i > 0 ? i : vq->vring_packed.num) - 1;
> > +	desc[prev].id = cpu_to_virtio16(_vq->vdev, id);
> 
> Is it easier to write this out in all descriptors, to avoid the need to
> calculate prev?

Yeah, I'll do that.

> 
> > +
> > +	/* Last one doesn't continue. */
> > +	if (total_sg == 1)
> > +		head_flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
> > +	else
> > +		desc[prev].flags &= cpu_to_virtio16(_vq->vdev,
> > +						~VRING_DESC_F_NEXT);
> 
> Wouldn't it be easier to avoid setting VRING_DESC_F_NEXT
> in the first place?
> 	 if (n != in_sgs - 1 && n != out_sgs - 1)

Will this affect the branch prediction?

> 
> must be better than writing descriptor an extra time.

Not quite sure about this. I think this descriptor has just
been written, it should still be in the cache.

> 
> > +
> > +	if (indirect) {
> > +		/* Now that the indirect table is filled in, map it. */
> > +		dma_addr_t addr = vring_map_single(
> > +			vq, desc, total_sg * sizeof(struct vring_packed_desc),
> > +			DMA_TO_DEVICE);
> > +		if (vring_mapping_error(vq, addr))
> > +			goto unmap_release;
> > +
> > +		head_flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_INDIRECT |
> > +				      _VRING_DESC_F_AVAIL(avail_wrap_counter) |
> > +				      _VRING_DESC_F_USED(!avail_wrap_counter));
> > +		vq->vring_packed.desc[head].addr = cpu_to_virtio64(_vq->vdev,
> > +								   addr);
> > +		vq->vring_packed.desc[head].len = cpu_to_virtio32(_vq->vdev,
> > +				total_sg * sizeof(struct vring_packed_desc));
> > +		vq->vring_packed.desc[head].id = cpu_to_virtio16(_vq->vdev, id);
> > +
> > +		if (vring_use_dma_api(_vq->vdev)) {
> > +			vq->desc_state_packed[id].addr = addr;
> > +			vq->desc_state_packed[id].len = total_sg *
> > +					sizeof(struct vring_packed_desc);
> > +			vq->desc_state_packed[id].flags =
> > +					virtio16_to_cpu(_vq->vdev, head_flags);
> > +		}
> > +	}
> > +
> > +	/* We're using some buffers from the free list. */
> > +	vq->vq.num_free -= descs_used;
> > +
> > +	/* Update free pointer */
> > +	if (indirect) {
> > +		n = head + 1;
> > +		if (n >= vq->vring_packed.num) {
> > +			n = 0;
> > +			vq->avail_wrap_counter ^= 1;
> > +		}
> > +		vq->next_avail_idx = n;
> > +		vq->free_head = vq->desc_state_packed[id].next;
> > +	} else {
> > +		vq->next_avail_idx = i;
> > +		vq->free_head = curr;
> > +	}
> > +
> > +	/* Store token and indirect buffer state. */
> > +	vq->desc_state_packed[id].num = descs_used;
> > +	vq->desc_state_packed[id].data = data;
> > +	if (indirect)
> > +		vq->desc_state_packed[id].indir_desc = desc;
> > +	else
> > +		vq->desc_state_packed[id].indir_desc = ctx;
> > +
> > +	/* A driver MUST NOT make the first descriptor in the list
> > +	 * available before all subsequent descriptors comprising
> > +	 * the list are made available. */
> > +	virtio_wmb(vq->weak_barriers);
> > +	vq->vring_packed.desc[head].flags = head_flags;
> > +	vq->num_added += descs_used;
> > +
> > +	pr_debug("Added buffer head %i to %p\n", head, vq);
> > +	END_USE(vq);
> > +
> > +	return 0;
> > +
> > +unmap_release:
> > +	err_idx = i;
> > +	i = head;
> > +
> > +	for (n = 0; n < total_sg; n++) {
> > +		if (i == err_idx)
> > +			break;
> > +		vring_unmap_desc_packed(vq, &desc[i]);
> > +		i++;
> > +		if (!indirect && i >= vq->vring_packed.num)
> > +			i = 0;
> > +	}
> > +
> > +	vq->avail_wrap_counter = avail_wrap_counter;
> > +
> > +	if (indirect)
> > +		kfree(desc);
> > +
> > +	END_USE(vq);
> >  	return -EIO;
> >  }
> >  
> >  static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> >  {
> > -	return false;
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +	u16 flags;
> > +	bool needs_kick;
> > +	u32 snapshot;
> > +
> > +	START_USE(vq);
> > +	/* We need to expose the new flags value before checking notification
> > +	 * suppressions. */
> > +	virtio_mb(vq->weak_barriers);
> > +
> > +	snapshot = READ_ONCE(*(u32 *)vq->vring_packed.device);
> > +	flags = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot >> 16)) & 0x3;
> 
> What are all these hard-coded things? Also either we do READ_ONCE
> everywhere or nowhere. Why is this place special? Any why read 32 bit
> if you only want the 16?

Yeah, READ_ONCE() and reading 32bits are not really needed in
this patch. But it's needed in the next patch. I thought it's
not wrong to do this, so I introduced them from the first place.

Just to double check: is the below code (apart from the
hard-coded value and virtio16) from the next patch OK?

"""
    snapshot = READ_ONCE(*(u32 *)vq->vring_packed.device);
+   off_wrap = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot & 0xffff));
    flags = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot >> 16)) & 0x3;
"""

> 
> And doesn't sparse complain about cast to __virtio16?

I'll give it a try in the next version.

> 
> > +
> > +#ifdef DEBUG
> > +	if (vq->last_add_time_valid) {
> > +		WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
> > +					      vq->last_add_time)) > 100);
> > +	}
> > +	vq->last_add_time_valid = false;
> > +#endif
> > +
> > +	needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > +	END_USE(vq);
> > +	return needs_kick;
> > +}
> > +
> > +static void detach_buf_packed(struct vring_virtqueue *vq,
> > +			      unsigned int id, void **ctx)
> > +{
> > +	struct vring_desc_state_packed *state = NULL;
> > +	struct vring_packed_desc *desc;
> > +	unsigned int curr, i;
> > +
> > +	/* Clear data ptr. */
> > +	vq->desc_state_packed[id].data = NULL;
> > +
> > +	curr = id;
> > +	for (i = 0; i < vq->desc_state_packed[id].num; i++) {
> > +		state = &vq->desc_state_packed[curr];
> > +		vring_unmap_state_packed(vq, state);
> > +		curr = state->next;
> > +	}
> > +
> > +	BUG_ON(state == NULL);
> > +	vq->vq.num_free += vq->desc_state_packed[id].num;
> > +	state->next = vq->free_head;
> > +	vq->free_head = id;
> > +
> > +	if (vq->indirect) {
> > +		u32 len;
> > +
> > +		/* Free the indirect table, if any, now that it's unmapped. */
> > +		desc = vq->desc_state_packed[id].indir_desc;
> > +		if (!desc)
> > +			return;
> > +
> > +		if (vring_use_dma_api(vq->vq.vdev)) {
> > +			len = vq->desc_state_packed[id].len;
> > +			for (i = 0; i < len / sizeof(struct vring_packed_desc);
> > +					i++)
> > +				vring_unmap_desc_packed(vq, &desc[i]);
> > +		}
> > +		kfree(desc);
> > +		vq->desc_state_packed[id].indir_desc = NULL;
> > +	} else if (ctx) {
> > +		*ctx = vq->desc_state_packed[id].indir_desc;
> > +	}
> > +}
> > +
> > +static inline bool is_used_desc_packed(const struct vring_virtqueue *vq,
> > +				       u16 idx, bool used_wrap_counter)
> > +{
> > +	u16 flags;
> > +	bool avail, used;
> > +
> > +	flags = virtio16_to_cpu(vq->vq.vdev,
> > +				vq->vring_packed.desc[idx].flags);
> > +	avail = !!(flags & VRING_DESC_F_AVAIL);
> > +	used = !!(flags & VRING_DESC_F_USED);
> > +
> > +	return avail == used && used == used_wrap_counter;
> 
> I think that you don't need to look at avail flag to detect a used
> descriptor. The reason device writes it is to avoid confusing
> *device* next time descriptor wraps.

Okay, I'll just check the used flag.

> 
> >  }
> >  
> >  static inline bool more_used_packed(const struct vring_virtqueue *vq)
> >  {
> > -	return false;
> > +	return is_used_desc_packed(vq, vq->last_used_idx,
> > +			vq->used_wrap_counter);
> >  }
> >  
> >  static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
> >  					  unsigned int *len,
> >  					  void **ctx)
> >  {
> > -	return NULL;
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +	u16 last_used, id;
> > +	void *ret;
> > +
> > +	START_USE(vq);
> > +
> > +	if (unlikely(vq->broken)) {
> > +		END_USE(vq);
> > +		return NULL;
> > +	}
> > +
> > +	if (!more_used_packed(vq)) {
> > +		pr_debug("No more buffers in queue\n");
> > +		END_USE(vq);
> > +		return NULL;
> > +	}
> > +
> > +	/* Only get used elements after they have been exposed by host. */
> > +	virtio_rmb(vq->weak_barriers);
> > +
> > +	last_used = vq->last_used_idx;
> > +	id = virtio16_to_cpu(_vq->vdev, vq->vring_packed.desc[last_used].id);
> > +	*len = virtio32_to_cpu(_vq->vdev, vq->vring_packed.desc[last_used].len);
> > +
> > +	if (unlikely(id >= vq->vring_packed.num)) {
> > +		BAD_RING(vq, "id %u out of range\n", id);
> > +		return NULL;
> > +	}
> > +	if (unlikely(!vq->desc_state_packed[id].data)) {
> > +		BAD_RING(vq, "id %u is not a head!\n", id);
> > +		return NULL;
> > +	}
> > +
> > +	vq->last_used_idx += vq->desc_state_packed[id].num;
> > +	if (vq->last_used_idx >= vq->vring_packed.num) {
> > +		vq->last_used_idx -= vq->vring_packed.num;
> > +		vq->used_wrap_counter ^= 1;
> > +	}
> > +
> > +	/* detach_buf_packed clears data, so grab it now. */
> > +	ret = vq->desc_state_packed[id].data;
> > +	detach_buf_packed(vq, id, ctx);
> > +
> > +#ifdef DEBUG
> > +	vq->last_add_time_valid = false;
> > +#endif
> > +
> > +	END_USE(vq);
> > +	return ret;
> >  }
> >  
> >  static void virtqueue_disable_cb_packed(struct virtqueue *_vq)
> >  {
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > +	if (vq->event_flags_shadow != VRING_EVENT_F_DISABLE) {
> > +		vq->event_flags_shadow = VRING_EVENT_F_DISABLE;
> > +		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > +							vq->event_flags_shadow);
> > +	}
> >  }
> >  
> >  static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> >  {
> > -	return 0;
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > +	START_USE(vq);
> > +
> > +	/* We optimistically turn back on interrupts, then check if there was
> > +	 * more to do. */
> > +
> > +	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > +		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > +							vq->event_flags_shadow);
> > +	}
> > +
> > +	END_USE(vq);
> > +	return vq->last_used_idx | ((u16)vq->used_wrap_counter << 15);
> >  }
> >  
> > -static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
> > +static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned off_wrap)
> >  {
> > -	return false;
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +	bool wrap_counter;
> > +	u16 used_idx;
> > +
> > +	wrap_counter = off_wrap >> 15;
> > +	used_idx = off_wrap & ~(1 << 15);
> > +
> > +	return is_used_desc_packed(vq, used_idx, wrap_counter);
> 
> These >> 15 << 15 all over the place duplicate info.
> Pls put 15 in the header.

Sure.

> 
> Also can you maintain the counters properly shifted?
> Then just use them.

Then, we may need to maintain both of the shifted wrapper
counters and un-shifted wrapper counters at the same time.

> 
> 
> >  }
> >  
> >  static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> >  {
> > -	return false;
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > +	START_USE(vq);
> > +
> > +	/* We optimistically turn back on interrupts, then check if there was
> > +	 * more to do. */
> > +
> > +	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > +		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > +							vq->event_flags_shadow);
> > +		/* We need to enable interrupts first before re-checking
> > +		 * for more used buffers. */
> > +		virtio_mb(vq->weak_barriers);
> > +	}
> > +
> > +	if (more_used_packed(vq)) {
> > +		END_USE(vq);
> > +		return false;
> > +	}
> > +
> > +	END_USE(vq);
> > +	return true;
> >  }
> >  
> >  static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
> >  {
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +	unsigned int i;
> > +	void *buf;
> > +
> > +	START_USE(vq);
> > +
> > +	for (i = 0; i < vq->vring_packed.num; i++) {
> > +		if (!vq->desc_state_packed[i].data)
> > +			continue;
> > +		/* detach_buf clears data, so grab it now. */
> > +		buf = vq->desc_state_packed[i].data;
> > +		detach_buf_packed(vq, i, NULL);
> > +		END_USE(vq);
> > +		return buf;
> > +	}
> > +	/* That should have freed everything. */
> > +	BUG_ON(vq->vq.num_free != vq->vring_packed.num);
> > +
> > +	END_USE(vq);
> >  	return NULL;
> >  }
> >  
> > @@ -1083,6 +1559,9 @@ bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx)
> >  {
> >  	struct vring_virtqueue *vq = to_vvq(_vq);
> >  
> > +	/* We need to enable interrupts first before re-checking
> > +	 * for more used buffers. */
> > +	virtio_mb(vq->weak_barriers);
> >  	return vq->packed ? virtqueue_poll_packed(_vq, last_used_idx) :
> >  			    virtqueue_poll_split(_vq, last_used_idx);
> >  }
> 
> Possible optimization for when dma API is in use:

Got it. Will give it a try!

Best regards,
Tiwei Bie

> 
> --->
> 
> dma: detecting nop unmap
> 
> drivers need to maintain the dma address for unmap purposes,
> but these cycles are wasted when unmap callback is not
> defined. Add an API for drivers to check that and avoid
> unmap completely. Debug builds still have unmap.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ----
> 
> diff --git a/include/linux/dma-debug.h b/include/linux/dma-debug.h
> index a785f2507159..38b2c27c8449 100644
> --- a/include/linux/dma-debug.h
> +++ b/include/linux/dma-debug.h
> @@ -42,6 +42,11 @@ extern void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
>  extern void debug_dma_unmap_page(struct device *dev, dma_addr_t addr,
>  				 size_t size, int direction, bool map_single);
>  
> +static inline bool has_debug_dma_unmap(struct device *dev)
> +{
> +	return true;
> +}
> +
>  extern void debug_dma_map_sg(struct device *dev, struct scatterlist *sg,
>  			     int nents, int mapped_ents, int direction);
>  
> @@ -121,6 +126,11 @@ static inline void debug_dma_unmap_page(struct device *dev, dma_addr_t addr,
>  {
>  }
>  
> +static inline bool has_debug_dma_unmap(struct device *dev)
> +{
> +	return false;
> +}
> +
>  static inline void debug_dma_map_sg(struct device *dev, struct scatterlist *sg,
>  				    int nents, int mapped_ents, int direction)
>  {
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 1db6a6b46d0d..656f3e518166 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -241,6 +241,14 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
>  	return addr;
>  }
>  
> +static inline bool has_dma_unmap(struct device *dev)
> +{
> +	const struct dma_map_ops *ops = get_dma_ops(dev);
> +
> +	return ops->unmap_page || ops->unmap_sg || ops->unmap_resource ||
> +		has_dma_unmap(dev);
> +}
> +
>  static inline void dma_unmap_single_attrs(struct device *dev, dma_addr_t addr,
>  					  size_t size,
>  					  enum dma_data_direction dir,

^ permalink raw reply

* Re: [PATCH net-next v2 1/5] virtio: add packed ring definitions
From: Tiwei Bie @ 2018-09-10  2:13 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180907094922-mutt-send-email-mst@kernel.org>

On Fri, Sep 07, 2018 at 09:51:23AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 11, 2018 at 10:27:07AM +0800, Tiwei Bie wrote:
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > ---
> >  include/uapi/linux/virtio_config.h |  3 +++
> >  include/uapi/linux/virtio_ring.h   | 43 ++++++++++++++++++++++++++++++
> >  2 files changed, 46 insertions(+)
> > 
> > diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
> > index 449132c76b1c..1196e1c1d4f6 100644
> > --- a/include/uapi/linux/virtio_config.h
> > +++ b/include/uapi/linux/virtio_config.h
> > @@ -75,6 +75,9 @@
> >   */
> >  #define VIRTIO_F_IOMMU_PLATFORM		33
> >  
> > +/* This feature indicates support for the packed virtqueue layout. */
> > +#define VIRTIO_F_RING_PACKED		34
> > +
> >  /*
> >   * Does the device support Single Root I/O Virtualization?
> >   */
> > diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
> > index 6d5d5faa989b..0254a2ba29cf 100644
> > --- a/include/uapi/linux/virtio_ring.h
> > +++ b/include/uapi/linux/virtio_ring.h
> > @@ -44,6 +44,10 @@
> >  /* This means the buffer contains a list of buffer descriptors. */
> >  #define VRING_DESC_F_INDIRECT	4
> >  
> > +/* Mark a descriptor as available or used. */
> > +#define VRING_DESC_F_AVAIL	(1ul << 7)
> > +#define VRING_DESC_F_USED	(1ul << 15)
> > +
> >  /* The Host uses this in used->flags to advise the Guest: don't kick me when
> >   * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
> >   * will still kick if it's out of buffers. */
> > @@ -53,6 +57,17 @@
> >   * optimization.  */
> >  #define VRING_AVAIL_F_NO_INTERRUPT	1
> >  
> > +/* Enable events. */
> > +#define VRING_EVENT_F_ENABLE	0x0
> > +/* Disable events. */
> > +#define VRING_EVENT_F_DISABLE	0x1
> > +/*
> > + * Enable events for a specific descriptor
> > + * (as specified by Descriptor Ring Change Event Offset/Wrap Counter).
> > + * Only valid if VIRTIO_RING_F_EVENT_IDX has been negotiated.
> > + */
> > +#define VRING_EVENT_F_DESC	0x2
> > +
> >  /* We support indirect buffer descriptors */
> >  #define VIRTIO_RING_F_INDIRECT_DESC	28
> >  
> 
> These are for the packed ring, right? Pls prefix accordingly.

How about something like this:

#define VRING_PACKED_DESC_F_AVAIL	(1u << 7)
#define VRING_PACKED_DESC_F_USED	(1u << 15)

#define VRING_PACKED_EVENT_F_ENABLE	0x0
#define VRING_PACKED_EVENT_F_DISABLE	0x1
#define VRING_PACKED_EVENT_F_DESC	0x2


> Also, you likely need macros for the wrap counters.

How about something like this:

#define VRING_PACKED_EVENT_WRAP_COUNTER_SHIFT	15
#define VRING_PACKED_EVENT_WRAP_COUNTER_MASK	\
			(1u << VRING_PACKED_WRAP_COUNTER_SHIFT)
#define VRING_PACKED_EVENT_OFFSET_MASK	\
			((1u << VRING_PACKED_WRAP_COUNTER_SHIFT) - 1)

> 
> > @@ -171,4 +186,32 @@ static inline int vring_need_event(__u16 event_idx, __u16 new_idx, __u16 old)
> >  	return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
> >  }
> >  
> > +struct vring_packed_desc_event {
> > +	/* Descriptor Ring Change Event Offset/Wrap Counter. */
> > +	__virtio16 off_wrap;
> > +	/* Descriptor Ring Change Event Flags. */
> > +	__virtio16 flags;
> > +};
> > +
> > +struct vring_packed_desc {
> > +	/* Buffer Address. */
> > +	__virtio64 addr;
> > +	/* Buffer Length. */
> > +	__virtio32 len;
> > +	/* Buffer ID. */
> > +	__virtio16 id;
> > +	/* The flags depending on descriptor type. */
> > +	__virtio16 flags;
> > +};
> 
> Don't use __virtioXX types, just __leXX ones.

Got it, will do that.

> 
> > +
> > +struct vring_packed {
> > +	unsigned int num;
> > +
> > +	struct vring_packed_desc *desc;
> > +
> > +	struct vring_packed_desc_event *driver;
> > +
> > +	struct vring_packed_desc_event *device;
> > +};
> > +
> >  #endif /* _UAPI_LINUX_VIRTIO_RING_H */
> > -- 
> > 2.18.0

^ permalink raw reply

* Re: [PATCH net-next v2 2/5] virtio_ring: support creating packed ring
From: Tiwei Bie @ 2018-09-10  2:28 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180907095208-mutt-send-email-mst@kernel.org>

On Fri, Sep 07, 2018 at 10:03:24AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 11, 2018 at 10:27:08AM +0800, Tiwei Bie wrote:
> > This commit introduces the support for creating packed ring.
> > All split ring specific functions are added _split suffix.
> > Some necessary stubs for packed ring are also added.
> > 
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> 
> I'd rather have a patch just renaming split functions, then
> add all packed stuff in as a separate patch on top.

Sure, I will do that.

> 
> 
> > ---
> >  drivers/virtio/virtio_ring.c | 801 +++++++++++++++++++++++------------
> >  include/linux/virtio_ring.h  |   8 +-
> >  2 files changed, 546 insertions(+), 263 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 814b395007b2..c4f8abc7445a 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -60,11 +60,15 @@ struct vring_desc_state {
> >  	struct vring_desc *indir_desc;	/* Indirect descriptor, if any. */
> >  };
> >  
> > +struct vring_desc_state_packed {
> > +	int next;			/* The next desc state. */
> 
> So this can go away with IN_ORDER?

Yes. If IN_ORDER is negotiated, next won't be needed anymore.
Currently, IN_ORDER isn't included in this patch set, because
some changes for split ring are needed to make sure that it
will use the descs in the expected order. After that,
optimizations can be done for both of split ring and packed
ring respectively.

> 
> > +};
> > +
> >  struct vring_virtqueue {
> >  	struct virtqueue vq;
> >  
> > -	/* Actual memory layout for this queue */
> > -	struct vring vring;
> > +	/* Is this a packed ring? */
> > +	bool packed;
> >  
> >  	/* Can we use weak barriers? */
> >  	bool weak_barriers;
> > @@ -86,11 +90,39 @@ struct vring_virtqueue {
> >  	/* Last used index we've seen. */
> >  	u16 last_used_idx;
> >  
> > -	/* Last written value to avail->flags */
> > -	u16 avail_flags_shadow;
> > +	union {
> > +		/* Available for split ring */
> > +		struct {
> > +			/* Actual memory layout for this queue. */
> > +			struct vring vring;
> >  
> > -	/* Last written value to avail->idx in guest byte order */
> > -	u16 avail_idx_shadow;
> > +			/* Last written value to avail->flags */
> > +			u16 avail_flags_shadow;
> > +
> > +			/* Last written value to avail->idx in
> > +			 * guest byte order. */
> > +			u16 avail_idx_shadow;
> > +		};
> 
> Name this field split so it's easier to detect misuse of e.g.
> packed fields in split code?

Good point, I'll do that.

> 
> > +
> > +		/* Available for packed ring */
> > +		struct {
> > +			/* Actual memory layout for this queue. */
> > +			struct vring_packed vring_packed;
> > +
> > +			/* Driver ring wrap counter. */
> > +			bool avail_wrap_counter;
> > +
> > +			/* Device ring wrap counter. */
> > +			bool used_wrap_counter;
> > +
> > +			/* Index of the next avail descriptor. */
> > +			u16 next_avail_idx;
> > +
> > +			/* Last written value to driver->flags in
> > +			 * guest byte order. */
> > +			u16 event_flags_shadow;
> > +		};
> > +	};
[...]
> > +
> > +/*
> > + * The layout for the packed ring is a continuous chunk of memory
> > + * which looks like this.
> > + *
> > + * struct vring_packed {
> > + *	// The actual descriptors (16 bytes each)
> > + *	struct vring_packed_desc desc[num];
> > + *
> > + *	// Padding to the next align boundary.
> > + *	char pad[];
> > + *
> > + *	// Driver Event Suppression
> > + *	struct vring_packed_desc_event driver;
> > + *
> > + *	// Device Event Suppression
> > + *	struct vring_packed_desc_event device;
> > + * };
> > + */
> 
> Why not just allocate event structures separately?
> Is it a win to have them share a cache line for some reason?

Will do that.

> 
> > +static inline void vring_init_packed(struct vring_packed *vr, unsigned int num,
> > +				     void *p, unsigned long align)
> > +{
> > +	vr->num = num;
> > +	vr->desc = p;
> > +	vr->driver = (void *)ALIGN(((uintptr_t)p +
> > +		sizeof(struct vring_packed_desc) * num), align);
> > +	vr->device = vr->driver + 1;
> > +}
> 
> What's all this about alignment? Where does it come from?

It comes from the `vring_align` parameter of vring_create_virtqueue()
and vring_new_virtqueue():

https://github.com/torvalds/linux/blob/a49a9dcce802/drivers/virtio/virtio_ring.c#L1061
https://github.com/torvalds/linux/blob/a49a9dcce802/drivers/virtio/virtio_ring.c#L1123

Should I just ignore it in packed ring?

CCW defined this:

#define KVM_VIRTIO_CCW_RING_ALIGN 4096

I'm not familiar with CCW. Currently, in this patch set, packed ring
isn't enabled on CCW dues to some legacy accessors are not implemented
in packed ring yet.

> 
> > +
> > +static inline unsigned vring_size_packed(unsigned int num, unsigned long align)
> > +{
> > +	return ((sizeof(struct vring_packed_desc) * num + align - 1)
> > +		& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
> > +}
[...]
> > @@ -1129,10 +1388,17 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
> >  				      void (*callback)(struct virtqueue *vq),
> >  				      const char *name)
> >  {
> > -	struct vring vring;
> > -	vring_init(&vring, num, pages, vring_align);
> > -	return __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
> > -				     notify, callback, name);
> > +	union vring_union vring;
> > +	bool packed;
> > +
> > +	packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
> > +	if (packed)
> > +		vring_init_packed(&vring.vring_packed, num, pages, vring_align);
> > +	else
> > +		vring_init(&vring.vring_split, num, pages, vring_align);
> 
> 
> vring_init in the UAPI header is more or less a bug.
> I'd just stop using it, keep it around for legacy userspace.

Got it. I'd like to do that. Thanks.

> 
> > +
> > +	return __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
> > +				     context, notify, callback, name);
> >  }
> >  EXPORT_SYMBOL_GPL(vring_new_virtqueue);
> >  
> > @@ -1142,7 +1408,9 @@ void vring_del_virtqueue(struct virtqueue *_vq)
> >  
> >  	if (vq->we_own_ring) {
> >  		vring_free_queue(vq->vq.vdev, vq->queue_size_in_bytes,
> > -				 vq->vring.desc, vq->queue_dma_addr);
> > +				 vq->packed ? (void *)vq->vring_packed.desc :
> > +					      (void *)vq->vring.desc,
> > +				 vq->queue_dma_addr);
> >  	}
> >  	list_del(&_vq->list);
> >  	kfree(vq);
> > @@ -1184,7 +1452,7 @@ unsigned int virtqueue_get_vring_size(struct virtqueue *_vq)
> >  
> >  	struct vring_virtqueue *vq = to_vvq(_vq);
> >  
> > -	return vq->vring.num;
> > +	return vq->packed ? vq->vring_packed.num : vq->vring.num;
> >  }
> >  EXPORT_SYMBOL_GPL(virtqueue_get_vring_size);
> >  
> > @@ -1227,6 +1495,10 @@ dma_addr_t virtqueue_get_avail_addr(struct virtqueue *_vq)
> >  
> >  	BUG_ON(!vq->we_own_ring);
> >  
> > +	if (vq->packed)
> > +		return vq->queue_dma_addr + ((char *)vq->vring_packed.driver -
> > +				(char *)vq->vring_packed.desc);
> > +
> >  	return vq->queue_dma_addr +
> >  		((char *)vq->vring.avail - (char *)vq->vring.desc);
> >  }
> > @@ -1238,11 +1510,16 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq)
> >  
> >  	BUG_ON(!vq->we_own_ring);
> >  
> > +	if (vq->packed)
> > +		return vq->queue_dma_addr + ((char *)vq->vring_packed.device -
> > +				(char *)vq->vring_packed.desc);
> > +
> >  	return vq->queue_dma_addr +
> >  		((char *)vq->vring.used - (char *)vq->vring.desc);
> >  }
> >  EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
> >  
> > +/* Only available for split ring */
> >  const struct vring *virtqueue_get_vring(struct virtqueue *vq)
> >  {
> >  	return &to_vvq(vq)->vring;
> > diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
> > index fab02133a919..992142b35f55 100644
> > --- a/include/linux/virtio_ring.h
> > +++ b/include/linux/virtio_ring.h
> > @@ -60,6 +60,11 @@ static inline void virtio_store_mb(bool weak_barriers,
> >  struct virtio_device;
> >  struct virtqueue;
> >  
> > +union vring_union {
> > +	struct vring vring_split;
> > +	struct vring_packed vring_packed;
> > +};
> > +
> >  /*
> >   * Creates a virtqueue and allocates the descriptor ring.  If
> >   * may_reduce_num is set, then this may allocate a smaller ring than
> > @@ -79,7 +84,8 @@ struct virtqueue *vring_create_virtqueue(unsigned int index,
> >  
> >  /* Creates a virtqueue with a custom layout. */
> >  struct virtqueue *__vring_new_virtqueue(unsigned int index,
> > -					struct vring vring,
> > +					union vring_union vring,
> > +					bool packed,
> >  					struct virtio_device *vdev,
> >  					bool weak_barriers,
> >  					bool ctx,
> > -- 
> > 2.18.0

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH net-next v2 4/5] virtio_ring: add event idx support in packed ring
From: Tiwei Bie @ 2018-09-10  2:35 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180907100341-mutt-send-email-mst@kernel.org>

On Fri, Sep 07, 2018 at 10:10:14AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 11, 2018 at 10:27:10AM +0800, Tiwei Bie wrote:
> > This commit introduces the EVENT_IDX support in packed ring.
> > 
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> 
> Besides the usual comment about hard-coded constants like <<15:
> does this actually do any good for performance? We don't
> have to if we do not want to.

Got it. Thanks.

> 
> > ---
> >  drivers/virtio/virtio_ring.c | 73 ++++++++++++++++++++++++++++++++----
> >  1 file changed, 65 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index f317b485ba54..f79a1e17f7d1 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -1050,7 +1050,7 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
> >  static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> >  {
> >  	struct vring_virtqueue *vq = to_vvq(_vq);
> > -	u16 flags;
> > +	u16 new, old, off_wrap, flags, wrap_counter, event_idx;
> >  	bool needs_kick;
> >  	u32 snapshot;
> >  
> > @@ -1059,9 +1059,19 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> >  	 * suppressions. */
> >  	virtio_mb(vq->weak_barriers);
> >  
> > +	old = vq->next_avail_idx - vq->num_added;
> > +	new = vq->next_avail_idx;
> > +	vq->num_added = 0;
> > +
> >  	snapshot = READ_ONCE(*(u32 *)vq->vring_packed.device);
> > +	off_wrap = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot & 0xffff));
> >  	flags = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot >> 16)) & 0x3;
> 
> 
> some kind of struct union would be helpful to make this readable.

I will define a struct/union for this.

> 
> >  
> > +	wrap_counter = off_wrap >> 15;
> > +	event_idx = off_wrap & ~(1 << 15);
> > +	if (wrap_counter != vq->avail_wrap_counter)
> > +		event_idx -= vq->vring_packed.num;
> > +
> >  #ifdef DEBUG
> >  	if (vq->last_add_time_valid) {
> >  		WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
> > @@ -1070,7 +1080,10 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> >  	vq->last_add_time_valid = false;
> >  #endif
> >  
> > -	needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > +	if (flags == VRING_EVENT_F_DESC)
> > +		needs_kick = vring_need_event(event_idx, new, old);
> > +	else
> > +		needs_kick = (flags != VRING_EVENT_F_DISABLE);
> >  	END_USE(vq);
> >  	return needs_kick;
> >  }
> > @@ -1185,6 +1198,15 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
> >  	ret = vq->desc_state_packed[id].data;
> >  	detach_buf_packed(vq, id, ctx);
> >  
> > +	/* If we expect an interrupt for the next entry, tell host
> > +	 * by writing event index and flush out the write before
> > +	 * the read in the next get_buf call. */
> > +	if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
> > +		virtio_store_mb(vq->weak_barriers,
> > +				&vq->vring_packed.driver->off_wrap,
> > +				cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
> > +					((u16)vq->used_wrap_counter << 15)));
> > +
> >  #ifdef DEBUG
> >  	vq->last_add_time_valid = false;
> >  #endif
> > @@ -1213,8 +1235,18 @@ static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> >  	/* We optimistically turn back on interrupts, then check if there was
> >  	 * more to do. */
> >  
> > +	if (vq->event) {
> > +		vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +				vq->last_used_idx |
> > +				((u16)vq->used_wrap_counter << 15));
> > +		/* We need to update event offset and event wrap
> > +		 * counter first before updating event flags. */
> > +		virtio_wmb(vq->weak_barriers);
> > +	}
> > +
> >  	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > -		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +		vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > +						     VRING_EVENT_F_ENABLE;
> >  		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> >  							vq->event_flags_shadow);
> >  	}
> > @@ -1238,22 +1270,47 @@ static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned off_wrap)
> >  static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> >  {
> >  	struct vring_virtqueue *vq = to_vvq(_vq);
> > +	u16 bufs, used_idx, wrap_counter;
> >  
> >  	START_USE(vq);
> >  
> >  	/* We optimistically turn back on interrupts, then check if there was
> >  	 * more to do. */
> >  
> > +	if (vq->event) {
> > +		/* TODO: tune this threshold */
> > +		bufs = (vq->vring_packed.num - _vq->num_free) * 3 / 4;
> > +		wrap_counter = vq->used_wrap_counter;
> > +
> > +		used_idx = vq->last_used_idx + bufs;
> > +		if (used_idx >= vq->vring_packed.num) {
> > +			used_idx -= vq->vring_packed.num;
> > +			wrap_counter ^= 1;
> > +		}
> > +
> > +		vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +				used_idx | (wrap_counter << 15));
> > +
> > +		/* We need to update event offset and event wrap
> > +		 * counter first before updating event flags. */
> > +		virtio_wmb(vq->weak_barriers);
> > +	} else {
> > +		used_idx = vq->last_used_idx;
> > +		wrap_counter = vq->used_wrap_counter;
> > +	}
> > +
> >  	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > -		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +		vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > +						     VRING_EVENT_F_ENABLE;
> >  		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> >  							vq->event_flags_shadow);
> > -		/* We need to enable interrupts first before re-checking
> > -		 * for more used buffers. */
> > -		virtio_mb(vq->weak_barriers);
> >  	}
> >  
> > -	if (more_used_packed(vq)) {
> > +	/* We need to update event suppression structure first
> > +	 * before re-checking for more used buffers. */
> > +	virtio_mb(vq->weak_barriers);
> > +
> 
> mb is expensive. We should not do it if we changed nothing.

I will try to avoid it when possible.

> 
> > +	if (is_used_desc_packed(vq, used_idx, wrap_counter)) {
> >  		END_USE(vq);
> >  		return false;
> >  	}
> > -- 
> > 2.18.0
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH net-next v2 0/5] virtio: support packed ring
From: Tiwei Bie @ 2018-09-10  3:00 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180907084509-mutt-send-email-mst@kernel.org>

On Fri, Sep 07, 2018 at 09:00:49AM -0400, Michael S. Tsirkin wrote:
> On Fri, Sep 07, 2018 at 09:22:25AM +0800, Tiwei Bie wrote:
> > On Mon, Aug 27, 2018 at 05:00:40PM +0300, Michael S. Tsirkin wrote:
> > > Are there still plans to test the performance with vost pmd?
> > > vhost doesn't seem to show a performance gain ...
> > > 
> > 
> > I tried some performance tests with vhost PMD. In guest, the
> > XDP program will return XDP_DROP directly. And in host, testpmd
> > will do txonly fwd.
> > 
> > When burst size is 1 and packet size is 64 in testpmd and
> > testpmd needs to iterate 5 Tx queues (but only the first two
> > queues are enabled) to prepare and inject packets, I got ~12%
> > performance boost (5.7Mpps -> 6.4Mpps). And if the vhost PMD
> > is faster (e.g. just need to iterate the first two queues to
> > prepare and inject packets), then I got similar performance
> > for both rings (~9.9Mpps) (packed ring's performance can be
> > lower, because it's more complicated in driver.)
> > 
> > I think packed ring makes vhost PMD faster, but it doesn't make
> > the driver faster. In packed ring, the ring is simplified, and
> > the handling of the ring in vhost (device) is also simplified,
> > but things are not simplified in driver, e.g. although there is
> > no desc table in the virtqueue anymore, driver still needs to
> > maintain a private desc state table (which is still managed as
> > a list in this patch set) to support the out-of-order desc
> > processing in vhost (device).
> > 
> > I think this patch set is mainly to make the driver have a full
> > functional support for the packed ring, which makes it possible
> > to leverage the packed ring feature in vhost (device). But I'm
> > not sure whether there is any other better idea, I'd like to
> > hear your thoughts. Thanks!
> 
> Just this: Jens seems to report a nice gain with virtio and
> vhost pmd across the board. Try to compare virtio and
> virtio pmd to see what does pmd do better?

The virtio PMD (drivers/net/virtio) in DPDK doesn't need to share
the virtio ring operation code with other drivers and is highly
optimized for network. E.g. in Rx, the Rx burst function won't
chain descs. So the ID management for the Rx ring can be quite
simple and straightforward, we just need to initialize these IDs
when initializing the ring and don't need to change these IDs
in data path anymore (the mergable Rx code in that patch set
assumes the descs will be written back in order, which should be
fixed. I.e., the ID in the desc should be used to index vq->descx[]).
The Tx code in that patch set also assumes the descs will be
written back by device in order, which should be fixed.

But in kernel virtio driver, the virtio_ring.c is very generic.
The enqueue (virtqueue_add()) and dequeue (virtqueue_get_buf_ctx())
functions need to support all the virtio devices and should be
able to handle all the possible cases that may happen. So although
the packed ring can be very efficient in some cases, currently
the room to optimize the performance in kernel's virtio_ring.c
isn't that much. If we want to take the fully advantage of the
packed ring's efficiency, we need some further e.g. API changes
in virtio_ring.c, which shouldn't be part of this patch set. So
I still think this patch set is mainly to make the kernel virtio
driver to have a full functional support of the packed ring, and
we can't expect impressive performance boost with it.

> 
> 
> > 
> > > 
> > > On Wed, Jul 11, 2018 at 10:27:06AM +0800, Tiwei Bie wrote:
> > > > Hello everyone,
> > > > 
> > > > This patch set implements packed ring support in virtio driver.
> > > > 
> > > > Some functional tests have been done with Jason's
> > > > packed ring implementation in vhost:
> > > > 
> > > > https://lkml.org/lkml/2018/7/3/33
> > > > 
> > > > Both of ping and netperf worked as expected.
> > > > 
> > > > v1 -> v2:
> > > > - Use READ_ONCE() to read event off_wrap and flags together (Jason);
> > > > - Add comments related to ccw (Jason);
> > > > 
> > > > RFC (v6) -> v1:
> > > > - Avoid extra virtio_wmb() in virtqueue_enable_cb_delayed_packed()
> > > >   when event idx is off (Jason);
> > > > - Fix bufs calculation in virtqueue_enable_cb_delayed_packed() (Jason);
> > > > - Test the state of the desc at used_idx instead of last_used_idx
> > > >   in virtqueue_enable_cb_delayed_packed() (Jason);
> > > > - Save wrap counter (as part of queue state) in the return value
> > > >   of virtqueue_enable_cb_prepare_packed();
> > > > - Refine the packed ring definitions in uapi;
> > > > - Rebase on the net-next tree;
> > > > 
> > > > RFC v5 -> RFC v6:
> > > > - Avoid tracking addr/len/flags when DMA API isn't used (MST/Jason);
> > > > - Define wrap counter as bool (Jason);
> > > > - Use ALIGN() in vring_init_packed() (Jason);
> > > > - Avoid using pointer to track `next` in detach_buf_packed() (Jason);
> > > > - Add comments for barriers (Jason);
> > > > - Don't enable RING_PACKED on ccw for now (noticed by Jason);
> > > > - Refine the memory barrier in virtqueue_poll();
> > > > - Add a missing memory barrier in virtqueue_enable_cb_delayed_packed();
> > > > - Remove the hacks in virtqueue_enable_cb_prepare_packed();
> > > > 
> > > > RFC v4 -> RFC v5:
> > > > - Save DMA addr, etc in desc state (Jason);
> > > > - Track used wrap counter;
> > > > 
> > > > RFC v3 -> RFC v4:
> > > > - Make ID allocation support out-of-order (Jason);
> > > > - Various fixes for EVENT_IDX support;
> > > > 
> > > > RFC v2 -> RFC v3:
> > > > - Split into small patches (Jason);
> > > > - Add helper virtqueue_use_indirect() (Jason);
> > > > - Just set id for the last descriptor of a list (Jason);
> > > > - Calculate the prev in virtqueue_add_packed() (Jason);
> > > > - Fix/improve desc suppression code (Jason/MST);
> > > > - Refine the code layout for XXX_split/packed and wrappers (MST);
> > > > - Fix the comments and API in uapi (MST);
> > > > - Remove the BUG_ON() for indirect (Jason);
> > > > - Some other refinements and bug fixes;
> > > > 
> > > > RFC v1 -> RFC v2:
> > > > - Add indirect descriptor support - compile test only;
> > > > - Add event suppression supprt - compile test only;
> > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > - Avoid using '%' operator (Jason);
> > > > - Rename free_head -> next_avail_idx (Jason);
> > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > - Some other refinements and bug fixes;
> > > > 
> > > > Thanks!
> > > > 
> > > > Tiwei Bie (5):
> > > >   virtio: add packed ring definitions
> > > >   virtio_ring: support creating packed ring
> > > >   virtio_ring: add packed ring support
> > > >   virtio_ring: add event idx support in packed ring
> > > >   virtio_ring: enable packed ring
> > > > 
> > > >  drivers/s390/virtio/virtio_ccw.c   |   14 +
> > > >  drivers/virtio/virtio_ring.c       | 1365 ++++++++++++++++++++++------
> > > >  include/linux/virtio_ring.h        |    8 +-
> > > >  include/uapi/linux/virtio_config.h |    3 +
> > > >  include/uapi/linux/virtio_ring.h   |   43 +
> > > >  5 files changed, 1157 insertions(+), 276 deletions(-)
> > > > 
> > > > -- 
> > > > 2.18.0
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > 

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH net-next v2 0/5] virtio: support packed ring
From: Jason Wang @ 2018-09-10  3:33 UTC (permalink / raw)
  To: Tiwei Bie, Michael S. Tsirkin
  Cc: virtio-dev, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180910030053.GA15645@debian>



On 2018年09月10日 11:00, Tiwei Bie wrote:
> On Fri, Sep 07, 2018 at 09:00:49AM -0400, Michael S. Tsirkin wrote:
>> On Fri, Sep 07, 2018 at 09:22:25AM +0800, Tiwei Bie wrote:
>>> On Mon, Aug 27, 2018 at 05:00:40PM +0300, Michael S. Tsirkin wrote:
>>>> Are there still plans to test the performance with vost pmd?
>>>> vhost doesn't seem to show a performance gain ...
>>>>
>>> I tried some performance tests with vhost PMD. In guest, the
>>> XDP program will return XDP_DROP directly. And in host, testpmd
>>> will do txonly fwd.
>>>
>>> When burst size is 1 and packet size is 64 in testpmd and
>>> testpmd needs to iterate 5 Tx queues (but only the first two
>>> queues are enabled) to prepare and inject packets, I got ~12%
>>> performance boost (5.7Mpps -> 6.4Mpps). And if the vhost PMD
>>> is faster (e.g. just need to iterate the first two queues to
>>> prepare and inject packets), then I got similar performance
>>> for both rings (~9.9Mpps) (packed ring's performance can be
>>> lower, because it's more complicated in driver.)
>>>
>>> I think packed ring makes vhost PMD faster, but it doesn't make
>>> the driver faster. In packed ring, the ring is simplified, and
>>> the handling of the ring in vhost (device) is also simplified,
>>> but things are not simplified in driver, e.g. although there is
>>> no desc table in the virtqueue anymore, driver still needs to
>>> maintain a private desc state table (which is still managed as
>>> a list in this patch set) to support the out-of-order desc
>>> processing in vhost (device).
>>>
>>> I think this patch set is mainly to make the driver have a full
>>> functional support for the packed ring, which makes it possible
>>> to leverage the packed ring feature in vhost (device). But I'm
>>> not sure whether there is any other better idea, I'd like to
>>> hear your thoughts. Thanks!
>> Just this: Jens seems to report a nice gain with virtio and
>> vhost pmd across the board. Try to compare virtio and
>> virtio pmd to see what does pmd do better?
> The virtio PMD (drivers/net/virtio) in DPDK doesn't need to share
> the virtio ring operation code with other drivers and is highly
> optimized for network. E.g. in Rx, the Rx burst function won't
> chain descs. So the ID management for the Rx ring can be quite
> simple and straightforward, we just need to initialize these IDs
> when initializing the ring and don't need to change these IDs
> in data path anymore (the mergable Rx code in that patch set
> assumes the descs will be written back in order, which should be
> fixed. I.e., the ID in the desc should be used to index vq->descx[]).
> The Tx code in that patch set also assumes the descs will be
> written back by device in order, which should be fixed.

Yes it is. I think I've pointed it out in some early version of pmd 
patch. So I suspect part (or all) of the boost may come from in order 
feature.

>
> But in kernel virtio driver, the virtio_ring.c is very generic.
> The enqueue (virtqueue_add()) and dequeue (virtqueue_get_buf_ctx())
> functions need to support all the virtio devices and should be
> able to handle all the possible cases that may happen. So although
> the packed ring can be very efficient in some cases, currently
> the room to optimize the performance in kernel's virtio_ring.c
> isn't that much. If we want to take the fully advantage of the
> packed ring's efficiency, we need some further e.g. API changes
> in virtio_ring.c, which shouldn't be part of this patch set.

Could you please share more thoughts on this e.g how to improve the API? 
Notice since the API is shared by both split ring and packed ring, it 
may improve the performance of split ring as well. One can easily 
imagine a batching API, but it does not have many real users now, the 
only case is the XDP transmission which can accept an array of XDP frames.

> So
> I still think this patch set is mainly to make the kernel virtio
> driver to have a full functional support of the packed ring, and
> we can't expect impressive performance boost with it.

We can only gain when virtio ring layout is the bottleneck. If there're 
bottlenecks elsewhere, we probably won't see any increasing in the 
numbers. Vhost-net is an example, and lots of optimizations have proved 
that virtio ring is not the main bottleneck for the current codes. I 
suspect it also the case of virtio driver. Did perf tell us any 
interesting things in virtio driver?

Thanks

>
>>
>>>> On Wed, Jul 11, 2018 at 10:27:06AM +0800, Tiwei Bie wrote:
>>>>> Hello everyone,
>>>>>
>>>>> This patch set implements packed ring support in virtio driver.
>>>>>
>>>>> Some functional tests have been done with Jason's
>>>>> packed ring implementation in vhost:
>>>>>
>>>>> https://lkml.org/lkml/2018/7/3/33
>>>>>
>>>>> Both of ping and netperf worked as expected.
>>>>>
>>>>> v1 -> v2:
>>>>> - Use READ_ONCE() to read event off_wrap and flags together (Jason);
>>>>> - Add comments related to ccw (Jason);
>>>>>
>>>>> RFC (v6) -> v1:
>>>>> - Avoid extra virtio_wmb() in virtqueue_enable_cb_delayed_packed()
>>>>>    when event idx is off (Jason);
>>>>> - Fix bufs calculation in virtqueue_enable_cb_delayed_packed() (Jason);
>>>>> - Test the state of the desc at used_idx instead of last_used_idx
>>>>>    in virtqueue_enable_cb_delayed_packed() (Jason);
>>>>> - Save wrap counter (as part of queue state) in the return value
>>>>>    of virtqueue_enable_cb_prepare_packed();
>>>>> - Refine the packed ring definitions in uapi;
>>>>> - Rebase on the net-next tree;
>>>>>
>>>>> RFC v5 -> RFC v6:
>>>>> - Avoid tracking addr/len/flags when DMA API isn't used (MST/Jason);
>>>>> - Define wrap counter as bool (Jason);
>>>>> - Use ALIGN() in vring_init_packed() (Jason);
>>>>> - Avoid using pointer to track `next` in detach_buf_packed() (Jason);
>>>>> - Add comments for barriers (Jason);
>>>>> - Don't enable RING_PACKED on ccw for now (noticed by Jason);
>>>>> - Refine the memory barrier in virtqueue_poll();
>>>>> - Add a missing memory barrier in virtqueue_enable_cb_delayed_packed();
>>>>> - Remove the hacks in virtqueue_enable_cb_prepare_packed();
>>>>>
>>>>> RFC v4 -> RFC v5:
>>>>> - Save DMA addr, etc in desc state (Jason);
>>>>> - Track used wrap counter;
>>>>>
>>>>> RFC v3 -> RFC v4:
>>>>> - Make ID allocation support out-of-order (Jason);
>>>>> - Various fixes for EVENT_IDX support;
>>>>>
>>>>> RFC v2 -> RFC v3:
>>>>> - Split into small patches (Jason);
>>>>> - Add helper virtqueue_use_indirect() (Jason);
>>>>> - Just set id for the last descriptor of a list (Jason);
>>>>> - Calculate the prev in virtqueue_add_packed() (Jason);
>>>>> - Fix/improve desc suppression code (Jason/MST);
>>>>> - Refine the code layout for XXX_split/packed and wrappers (MST);
>>>>> - Fix the comments and API in uapi (MST);
>>>>> - Remove the BUG_ON() for indirect (Jason);
>>>>> - Some other refinements and bug fixes;
>>>>>
>>>>> RFC v1 -> RFC v2:
>>>>> - Add indirect descriptor support - compile test only;
>>>>> - Add event suppression supprt - compile test only;
>>>>> - Move vring_packed_init() out of uapi (Jason, MST);
>>>>> - Merge two loops into one in virtqueue_add_packed() (Jason);
>>>>> - Split vring_unmap_one() for packed ring and split ring (Jason);
>>>>> - Avoid using '%' operator (Jason);
>>>>> - Rename free_head -> next_avail_idx (Jason);
>>>>> - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
>>>>> - Some other refinements and bug fixes;
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Tiwei Bie (5):
>>>>>    virtio: add packed ring definitions
>>>>>    virtio_ring: support creating packed ring
>>>>>    virtio_ring: add packed ring support
>>>>>    virtio_ring: add event idx support in packed ring
>>>>>    virtio_ring: enable packed ring
>>>>>
>>>>>   drivers/s390/virtio/virtio_ccw.c   |   14 +
>>>>>   drivers/virtio/virtio_ring.c       | 1365 ++++++++++++++++++++++------
>>>>>   include/linux/virtio_ring.h        |    8 +-
>>>>>   include/uapi/linux/virtio_config.h |    3 +
>>>>>   include/uapi/linux/virtio_ring.h   |   43 +
>>>>>   5 files changed, 1157 insertions(+), 276 deletions(-)
>>>>>
>>>>> -- 
>>>>> 2.18.0
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next 06/11] tuntap: split out XDP logic
From: Jason Wang @ 2018-09-10  3:43 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180907101606-mutt-send-email-mst@kernel.org>



On 2018年09月07日 22:16, Michael S. Tsirkin wrote:
> On Fri, Sep 07, 2018 at 11:29:34AM +0800, Jason Wang wrote:
>>>> +		if (act != XDP_PASS)
>>>> +			goto out;
>>> likely?
>> It depends on the XDP program, so I tend not to use it.
> Point is XDP_PASS is already slow.
>

Ok.

Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next 04/11] tuntap: simplify error handling in tun_build_skb()
From: Jason Wang @ 2018-09-10  3:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180907101645-mutt-send-email-mst@kernel.org>



On 2018年09月07日 22:17, Michael S. Tsirkin wrote:
> On Fri, Sep 07, 2018 at 11:22:00AM +0800, Jason Wang wrote:
>>>> @@ -1668,6 +1668,9 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>>>>    	if (copied != len)
>>>>    		return ERR_PTR(-EFAULT);
>>>> +	get_page(alloc_frag->page);
>>>> +	alloc_frag->offset += buflen;
>>>> +
>>> This adds an atomic op on XDP_DROP which is a data path
>>> operation for some workloads.
>> Yes, I have patch on top to amortize this, the idea is to have a very big
>> refcount once after the frag was allocated and maintain a bias and decrease
>> them all when allocating new frags.'
> Why bother with refcounting for a drop though? It should be simple.
>

Right, let me fix this.

Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next 11/11] vhost_net: batch submitting XDP buffers to underlayer sockets
From: Jason Wang @ 2018-09-10  3:47 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180907121148-mutt-send-email-mst@kernel.org>



On 2018年09月08日 00:13, Michael S. Tsirkin wrote:
> On Fri, Sep 07, 2018 at 03:41:52PM +0800, Jason Wang wrote:
>>>> @@ -556,10 +667,14 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>>>>    	size_t len, total_len = 0;
>>>>    	int err;
>>>>    	int sent_pkts = 0;
>>>> +	bool bulking = (sock->sk->sk_sndbuf == INT_MAX);
>>> What does bulking mean?
>> The name is misleading, it means whether we can do batching. For simplicity,
>> I disable batching is sndbuf is not INT_MAX.
> But what does batching have to do with sndbuf?

If we want to do batching with sndbuf, sockets needs to return the 
number of packets that was successfully sent. And vhost need to examine 
the value.

Consider performance won't be good if sndbuf is limited, I don't 
implement this for simplicity.

>
>>>>    	for (;;) {
>>>>    		bool busyloop_intr = false;
>>>> +		if (nvq->done_idx == VHOST_NET_BATCH)
>>>> +			vhost_tx_batch(net, nvq, sock, &msg);
>>>> +
>>>>    		head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
>>>>    				   &busyloop_intr);
>>>>    		/* On error, stop handling until the next kick. */
>>>> @@ -577,14 +692,34 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>>>>    			break;
>>>>    		}
>>>> -		vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
>>>> -		vq->heads[nvq->done_idx].len = 0;
>>>> -
>>>>    		total_len += len;
>>>> -		if (tx_can_batch(vq, total_len))
>>>> -			msg.msg_flags |= MSG_MORE;
>>>> -		else
>>>> -			msg.msg_flags &= ~MSG_MORE;
>>>> +
>>>> +		/* For simplicity, TX batching is only enabled if
>>>> +		 * sndbuf is unlimited.
>>> What if sndbuf changes while this processing is going on?
>> We will get the correct sndbuf in the next run of handle_tx(). I think this
>> is safe.
> If it's safe why bother with special-casing INT_MAX?
>

The difference is handle_tx() won't loop forever and will recognize the 
new value next time, we have a quota to limit this.

Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-09-10  6:19 UTC (permalink / raw)
  To: Jiandi An
  Cc: brijesh.singh, srikar, Michael S. Tsirkin, Benjamin Herrenschmidt,
	Will Deacon, virtualization, paulus, elfring, Anshuman Khandual,
	robh, jean-philippe.brucker, mpe, Christoph Hellwig,
	thomas.lendacky, marc.zyngier, linuxram, david, linuxppc-dev,
	linux-kernel, joe, jiandi.an, robin.murphy, haren
In-Reply-To: <f1eeb994-ea13-d0f1-dc55-7c5049e70670@amd.com>

On Thu, Sep 06, 2018 at 07:09:09PM -0500, Jiandi An wrote:
> For virtio device we have to pass in iommu_platform=true flag for
> this to set the VIRTIO_F_IOMMU_PLATFORM flag. But for example
> QEMU has the use of iommu_platform attribute disabled for virtio-gpu
> device.  So would also like to move towards not having to specify
> the VIRTIO_F_IOMMU_PLATFORM flag.

Specifying VIRTIO_F_IOMMU_PLATFORM is the right thing for your 
platform given that you can't directly use physical addresses.
Please fix qemu so that virtio-gpu works with VIRTIO_F_IOMMU_PLATFORM.

Also just as I said for the power folks: you should really work with
the qemu folks that VIRTIO_F_IOMMU_PLATFORM (or whatever we call the
properly documented flag) can be set by default, and no pointless
performance overhead is implied by having a sane and simple
implementation.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox