Linux virtualization list
 help / color / mirror / Atom feed
* Re: [RFC v4 3/5] virtio_ring: add packed ring support
From: Tiwei Bie @ 2018-05-16 13:45 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <ecb50c84-b412-ff0b-6c52-fd789b6c8a86@redhat.com>

On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> On 2018年05月16日 20:39, Tiwei Bie wrote:
> > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > [...]
> > > >    struct vring_virtqueue {
> > > > @@ -116,6 +117,9 @@ struct vring_virtqueue {
> > > >    			/* Last written value to driver->flags in
> > > >    			 * guest byte order. */
> > > >    			u16 event_flags_shadow;
> > > > +
> > > > +			/* ID allocation. */
> > > > +			struct idr buffer_id;
> > > I'm not sure idr is fit for the performance critical case here. Need to
> > > measure its performance impact, especially if we have few unused slots.
> > I'm also not sure.. But fortunately, it should be quite easy
> > to replace it with something else without changing other code.
> > If it will really hurt the performance, I'll change it.
> 
> We may want to do some benchmarking/profiling to see.

Yeah!

> 
> > 
> > > >    		};
> > > >    	};
> > [...]
> > > > +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
> > > > +			      unsigned int id, void **ctx)
> > > > +{
> > > > +	struct vring_packed_desc *desc;
> > > > +	unsigned int i, j;
> > > > +
> > > > +	/* Clear data ptr. */
> > > > +	vq->desc_state[id].data = NULL;
> > > > +
> > > > +	i = head;
> > > > +
> > > > +	for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > +		desc = &vq->vring_packed.desc[i];
> > > > +		vring_unmap_one_packed(vq, desc);
> > > As mentioned in previous discussion, this probably won't work for the case
> > > of out of order completion since it depends on the information in the
> > > descriptor ring. We probably need to extend ctx to record such information.
> > Above code doesn't depend on the information in the descriptor
> > ring. The vq->desc_state[] is the extended ctx.
> > 
> > Best regards,
> > Tiwei Bie
> 
> Yes, but desc is a pointer to descriptor ring I think so
> vring_unmap_one_packed() still depends on the content of descriptor ring?
> 

I got your point now. I think it makes sense to reserve
the bits of the addr field. Driver shouldn't try to get
addrs from the descriptors when cleanup the descriptors
no matter whether we support out-of-order or not.

But combining it with the out-of-order support, it will
mean that the driver still needs to maintain a desc/ctx
list that is very similar to the desc ring in the split
ring. I'm not quite sure whether it's something we want.
If it is true, I'll do it. So do you think we also want
to maintain such a desc/ctx list for packed ring?

Best regards,
Tiwei Bie
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC v4 3/5] virtio_ring: add packed ring support
From: Jason Wang @ 2018-05-16 14:05 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180516134550.GB4171@debian>



On 2018年05月16日 21:45, Tiwei Bie wrote:
> On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
>> On 2018年05月16日 20:39, Tiwei Bie wrote:
>>> On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
>>>> On 2018年05月16日 16:37, Tiwei Bie wrote:
>>> [...]
>>>>>     struct vring_virtqueue {
>>>>> @@ -116,6 +117,9 @@ struct vring_virtqueue {
>>>>>     			/* Last written value to driver->flags in
>>>>>     			 * guest byte order. */
>>>>>     			u16 event_flags_shadow;
>>>>> +
>>>>> +			/* ID allocation. */
>>>>> +			struct idr buffer_id;
>>>> I'm not sure idr is fit for the performance critical case here. Need to
>>>> measure its performance impact, especially if we have few unused slots.
>>> I'm also not sure.. But fortunately, it should be quite easy
>>> to replace it with something else without changing other code.
>>> If it will really hurt the performance, I'll change it.
>> We may want to do some benchmarking/profiling to see.
> Yeah!
>
>>>>>     		};
>>>>>     	};
>>> [...]
>>>>> +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
>>>>> +			      unsigned int id, void **ctx)
>>>>> +{
>>>>> +	struct vring_packed_desc *desc;
>>>>> +	unsigned int i, j;
>>>>> +
>>>>> +	/* Clear data ptr. */
>>>>> +	vq->desc_state[id].data = NULL;
>>>>> +
>>>>> +	i = head;
>>>>> +
>>>>> +	for (j = 0; j < vq->desc_state[id].num; j++) {
>>>>> +		desc = &vq->vring_packed.desc[i];
>>>>> +		vring_unmap_one_packed(vq, desc);
>>>> As mentioned in previous discussion, this probably won't work for the case
>>>> of out of order completion since it depends on the information in the
>>>> descriptor ring. We probably need to extend ctx to record such information.
>>> Above code doesn't depend on the information in the descriptor
>>> ring. The vq->desc_state[] is the extended ctx.
>>>
>>> Best regards,
>>> Tiwei Bie
>> Yes, but desc is a pointer to descriptor ring I think so
>> vring_unmap_one_packed() still depends on the content of descriptor ring?
>>
> I got your point now. I think it makes sense to reserve
> the bits of the addr field. Driver shouldn't try to get
> addrs from the descriptors when cleanup the descriptors
> no matter whether we support out-of-order or not.

Maybe I was wrong, but I remember spec mentioned something like this.

>
> But combining it with the out-of-order support, it will
> mean that the driver still needs to maintain a desc/ctx
> list that is very similar to the desc ring in the split
> ring. I'm not quite sure whether it's something we want.
> If it is true, I'll do it. So do you think we also want
> to maintain such a desc/ctx list for packed ring?

To make it work for OOO backends I think we need something like this 
(hardware NIC drivers are usually have something like this).

Not for the patch, but it looks like having a OUT_OF_ORDER feature bit 
is much more simpler to be started with.

Thanks

>
> Best regards,
> Tiwei Bie

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC v4 3/5] virtio_ring: add packed ring support
From: Tiwei Bie @ 2018-05-16 14:33 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <eaf946f1-abab-c4e5-1ab5-ba7912986d58@redhat.com>

On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> On 2018年05月16日 21:45, Tiwei Bie wrote:
> > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
[...]
> > > > > > +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
> > > > > > +			      unsigned int id, void **ctx)
> > > > > > +{
> > > > > > +	struct vring_packed_desc *desc;
> > > > > > +	unsigned int i, j;
> > > > > > +
> > > > > > +	/* Clear data ptr. */
> > > > > > +	vq->desc_state[id].data = NULL;
> > > > > > +
> > > > > > +	i = head;
> > > > > > +
> > > > > > +	for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > +		desc = &vq->vring_packed.desc[i];
> > > > > > +		vring_unmap_one_packed(vq, desc);
> > > > > As mentioned in previous discussion, this probably won't work for the case
> > > > > of out of order completion since it depends on the information in the
> > > > > descriptor ring. We probably need to extend ctx to record such information.
> > > > Above code doesn't depend on the information in the descriptor
> > > > ring. The vq->desc_state[] is the extended ctx.
> > > > 
> > > > Best regards,
> > > > Tiwei Bie
> > > Yes, but desc is a pointer to descriptor ring I think so
> > > vring_unmap_one_packed() still depends on the content of descriptor ring?
> > > 
> > I got your point now. I think it makes sense to reserve
> > the bits of the addr field. Driver shouldn't try to get
> > addrs from the descriptors when cleanup the descriptors
> > no matter whether we support out-of-order or not.
> 
> Maybe I was wrong, but I remember spec mentioned something like this.

You're right. Spec mentioned this. I was just repeating
the spec to emphasize that it does make sense. :)

> 
> > 
> > But combining it with the out-of-order support, it will
> > mean that the driver still needs to maintain a desc/ctx
> > list that is very similar to the desc ring in the split
> > ring. I'm not quite sure whether it's something we want.
> > If it is true, I'll do it. So do you think we also want
> > to maintain such a desc/ctx list for packed ring?
> 
> To make it work for OOO backends I think we need something like this
> (hardware NIC drivers are usually have something like this).

Which hardware NIC drivers have this?

> 
> Not for the patch, but it looks like having a OUT_OF_ORDER feature bit is
> much more simpler to be started with.

+1

Best regards,
Tiwei Bie
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC v4 3/5] virtio_ring: add packed ring support
From: Jason Wang @ 2018-05-17 12:01 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180516143332.GA1957@debian>



On 2018年05月16日 22:33, Tiwei Bie wrote:
> On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
>> On 2018年05月16日 21:45, Tiwei Bie wrote:
>>> On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
>>>> On 2018年05月16日 20:39, Tiwei Bie wrote:
>>>>> On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
>>>>>> On 2018年05月16日 16:37, Tiwei Bie wrote:
> [...]
>>>>>>> +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
>>>>>>> +			      unsigned int id, void **ctx)
>>>>>>> +{
>>>>>>> +	struct vring_packed_desc *desc;
>>>>>>> +	unsigned int i, j;
>>>>>>> +
>>>>>>> +	/* Clear data ptr. */
>>>>>>> +	vq->desc_state[id].data = NULL;
>>>>>>> +
>>>>>>> +	i = head;
>>>>>>> +
>>>>>>> +	for (j = 0; j < vq->desc_state[id].num; j++) {
>>>>>>> +		desc = &vq->vring_packed.desc[i];
>>>>>>> +		vring_unmap_one_packed(vq, desc);
>>>>>> As mentioned in previous discussion, this probably won't work for the case
>>>>>> of out of order completion since it depends on the information in the
>>>>>> descriptor ring. We probably need to extend ctx to record such information.
>>>>> Above code doesn't depend on the information in the descriptor
>>>>> ring. The vq->desc_state[] is the extended ctx.
>>>>>
>>>>> Best regards,
>>>>> Tiwei Bie
>>>> Yes, but desc is a pointer to descriptor ring I think so
>>>> vring_unmap_one_packed() still depends on the content of descriptor ring?
>>>>
>>> I got your point now. I think it makes sense to reserve
>>> the bits of the addr field. Driver shouldn't try to get
>>> addrs from the descriptors when cleanup the descriptors
>>> no matter whether we support out-of-order or not.
>> Maybe I was wrong, but I remember spec mentioned something like this.
> You're right. Spec mentioned this. I was just repeating
> the spec to emphasize that it does make sense. :)
>
>>> But combining it with the out-of-order support, it will
>>> mean that the driver still needs to maintain a desc/ctx
>>> list that is very similar to the desc ring in the split
>>> ring. I'm not quite sure whether it's something we want.
>>> If it is true, I'll do it. So do you think we also want
>>> to maintain such a desc/ctx list for packed ring?
>> To make it work for OOO backends I think we need something like this
>> (hardware NIC drivers are usually have something like this).
> Which hardware NIC drivers have this?

It's quite common I think, e.g driver track e.g dma addr and page frag 
somewhere. e.g the ring->rx_info in mlx4 driver.

Thanks

>
>> Not for the patch, but it looks like having a OUT_OF_ORDER feature bit is
>> much more simpler to be started with.
> +1
>
> Best regards,
> Tiwei Bie

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH 01/24] drm/cirrus: Place GEM BOs in drm_framebuffer
From: Thierry Reding @ 2018-05-17 13:54 UTC (permalink / raw)
  To: Daniel Stone; +Cc: Dave Airlie, dri-devel, virtualization
In-Reply-To: <20180330141138.28987-1-daniels@collabora.com>


[-- Attachment #1.1: Type: text/plain, Size: 842 bytes --]

On Fri, Mar 30, 2018 at 03:11:15PM +0100, Daniel Stone wrote:
> Since drm_framebuffer can now store GEM objects directly, place them
> there rather than in our own subclass. As this makes the framebuffer
> create_handle and destroy functions the same as the GEM framebuffer
> helper, we can reuse those.
> 
> Signed-off-by: Daniel Stone <daniels@collabora.com>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: virtualization@lists.linux-foundation.org
> ---
>  drivers/gpu/drm/cirrus/cirrus_drv.h   |  1 -
>  drivers/gpu/drm/cirrus/cirrus_fbdev.c |  8 ++++----
>  drivers/gpu/drm/cirrus/cirrus_main.c  | 25 ++++---------------------
>  drivers/gpu/drm/cirrus/cirrus_mode.c  |  4 ++--
>  4 files changed, 10 insertions(+), 28 deletions(-)

Reviewed-by: Thierry Reding <treding@nvidia.com>

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH 02/24] drm/cirrus: cirrus_framebuffer -> drm_framebuffer
From: Thierry Reding @ 2018-05-17 14:25 UTC (permalink / raw)
  To: Daniel Stone; +Cc: Dave Airlie, dri-devel, virtualization
In-Reply-To: <20180330141138.28987-2-daniels@collabora.com>


[-- Attachment #1.1: Type: text/plain, Size: 713 bytes --]

On Fri, Mar 30, 2018 at 03:11:16PM +0100, Daniel Stone wrote:
> Now cirrus_framebuffer is just an empty wrapper around drm_framebuffer,
> we can drop it.
> 
> Signed-off-by: Daniel Stone <daniels@collabora.com>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: virtualization@lists.linux-foundation.org
> ---
>  drivers/gpu/drm/cirrus/cirrus_drv.h   |  9 ++-------
>  drivers/gpu/drm/cirrus/cirrus_fbdev.c | 20 ++++++++++----------
>  drivers/gpu/drm/cirrus/cirrus_main.c  | 20 ++++++++++----------
>  drivers/gpu/drm/cirrus/cirrus_mode.c  | 12 +++---------
>  4 files changed, 25 insertions(+), 36 deletions(-)

Reviewed-by: Thierry Reding <treding@nvidia.com>

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH 03/24] drm/virtio: Place GEM BOs in drm_framebuffer
From: Thierry Reding @ 2018-05-17 14:29 UTC (permalink / raw)
  To: Daniel Stone; +Cc: Dave Airlie, dri-devel, virtualization
In-Reply-To: <20180330141138.28987-3-daniels@collabora.com>


[-- Attachment #1.1: Type: text/plain, Size: 859 bytes --]

On Fri, Mar 30, 2018 at 03:11:17PM +0100, Daniel Stone wrote:
> Since drm_framebuffer can now store GEM objects directly, place them
> there rather than in our own subclass. As this makes the framebuffer
> create_handle and destroy functions the same as the GEM framebuffer
> helper, we can reuse those.
> 
> Signed-off-by: Daniel Stone <daniels@collabora.com>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: virtualization@lists.linux-foundation.org
> ---
>  drivers/gpu/drm/virtio/virtgpu_display.c | 30 +++++-------------------------
>  drivers/gpu/drm/virtio/virtgpu_drv.h     |  1 -
>  drivers/gpu/drm/virtio/virtgpu_fb.c      |  8 ++++----
>  drivers/gpu/drm/virtio/virtgpu_plane.c   |  4 ++--
>  4 files changed, 11 insertions(+), 32 deletions(-)

Reviewed-by: Thierry Reding <treding@nvidia.com>

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* RE: [PATCH 0/6] Macrofying inline assembly for better compilation
From: David Laight @ 2018-05-18  9:20 UTC (permalink / raw)
  To: 'Nadav Amit', linux-kernel@vger.kernel.org,
	x86@kernel.org
  Cc: Juergen Gross, Randy Dunlap, Kees Cook, Jonathan Corbet,
	Peter Zijlstra, Christopher Li, Josh Poimboeuf,
	virtualization@lists.linux-foundation.org,
	linux-sparse@vger.kernel.org, Ingo Molnar, Jan Beulich,
	H. Peter Anvin, Alok Kataria, nadav.amit@gmail.com,
	Thomas Gleixner
In-Reply-To: <20180517161402.78089-1-namit@vmware.com>

From: Nadav Amit
> Sent: 17 May 2018 17:14
> This patch-set deals with an interesting yet stupid problem: kernel code
> that does not get inlined despite its simplicity. There are several
> causes for this behavior: "cold" attribute on __init, different function
> optimization levels; conditional constant computations based on
> __builtin_constant_p(); and finally large inline assembly blocks.
> 
> This patch-set deals with the inline assembly problem. I separated these
> patches from the others (that were sent in the RFC) for easier
> inclusion.
> 
> The problem with inline assembly is that inline assembly is often used
> by the kernel for things that are other than code - for example,
> assembly directives and data. GCC however is oblivious to the content of
> the blocks and assumes their cost in space and time is proportional to
> the number of the perceived assembly "instruction", according to the
> number of newlines and semicolons. Alternatives, paravirt and other
> mechanisms are affected, causing code not to be inlined, and degrading
> compilation quality in general.
> 
> The solution that this patch-set carries for this problem is to create
> an assembly macro, and then call it from the inline assembly block.  As
> a result, the compiler sees a single "instruction" and assigns the more
> appropriate cost to the code. In addition, this patch-set removes
> unneeded new-lines from common x86 inline asm's, which "confuse" GCC
> heuristics.

Can't you get the same effect by using always_inline ?

	David

^ permalink raw reply

* Re: KASAN: use-after-free Read in vhost_chr_write_iter
From: Jason Wang @ 2018-05-18  9:24 UTC (permalink / raw)
  To: DaeRyong Jeong, mst
  Cc: bammanag, kt0755, kvm, netdev, linux-kernel, virtualization,
	byoungyoung
In-Reply-To: <20180517134544.GA20646@dragonet.kaist.ac.kr>



On 2018年05月17日 21:45, DaeRyong Jeong wrote:
> We report the crash: KASAN: use-after-free Read in vhost_chr_write_iter
>
> This crash has been found in v4.17-rc1 using RaceFuzzer (a modified
> version of Syzkaller), which we describe more at the end of this
> report. Our analysis shows that the race occurs when invoking two
> syscalls concurrently, write$vnet and ioctl$VHOST_RESET_OWNER.
>
>
> Analysis:
> We think the concurrent execution of vhost_process_iotlb_msg() and
> vhost_dev_cleanup() causes the crash.
> Both of functions can run concurrently (please see call sequence below),
> and possibly, there is a race on dev->iotlb.
> If the switch occurs right after vhost_dev_cleanup() frees
> dev->iotlb, vhost_process_iotlb_msg() still sees the non-null value and it
> keep executing without returning -EFAULT. Consequently, use-after-free
> occures
>
>
> Thread interleaving:
> CPU0 (vhost_process_iotlb_msg)				CPU1 (vhost_dev_cleanup)
> (In the case of both VHOST_IOTLB_UPDATE and
> VHOST_IOTLB_INVALIDATE)
> =====							=====
> 							vhost_umem_clean(dev->iotlb);
> if (!dev->iotlb) {
> 	        ret = -EFAULT;
> 		        break;
> }
> 							dev->iotlb = NULL;
>
>
> Call Sequence:
> CPU0
> =====
> vhost_net_chr_write_iter
> 	vhost_chr_write_iter
> 		vhost_process_iotlb_msg
>
> CPU1
> =====
> vhost_net_ioctl
> 	vhost_net_reset_owner
> 		vhost_dev_reset_owner
> 			vhost_dev_cleanup

Thanks a lot for the analysis.

This could be addressed by simply protect it with dev mutex.

Will post a patch.

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC v4 3/5] virtio_ring: add packed ring support
From: Tiwei Bie @ 2018-05-18 11:29 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <bc38e5a1-e920-7055-dc22-49ac98455257@redhat.com>

On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:
> On 2018年05月16日 22:33, Tiwei Bie wrote:
> > On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 21:45, Tiwei Bie wrote:
> > > > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > [...]
> > > > > > > > +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
> > > > > > > > +			      unsigned int id, void **ctx)
> > > > > > > > +{
> > > > > > > > +	struct vring_packed_desc *desc;
> > > > > > > > +	unsigned int i, j;
> > > > > > > > +
> > > > > > > > +	/* Clear data ptr. */
> > > > > > > > +	vq->desc_state[id].data = NULL;
> > > > > > > > +
> > > > > > > > +	i = head;
> > > > > > > > +
> > > > > > > > +	for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > > > +		desc = &vq->vring_packed.desc[i];
> > > > > > > > +		vring_unmap_one_packed(vq, desc);
> > > > > > > As mentioned in previous discussion, this probably won't work for the case
> > > > > > > of out of order completion since it depends on the information in the
> > > > > > > descriptor ring. We probably need to extend ctx to record such information.
> > > > > > Above code doesn't depend on the information in the descriptor
> > > > > > ring. The vq->desc_state[] is the extended ctx.
> > > > > > 
> > > > > > Best regards,
> > > > > > Tiwei Bie
> > > > > Yes, but desc is a pointer to descriptor ring I think so
> > > > > vring_unmap_one_packed() still depends on the content of descriptor ring?
> > > > > 
> > > > I got your point now. I think it makes sense to reserve
> > > > the bits of the addr field. Driver shouldn't try to get
> > > > addrs from the descriptors when cleanup the descriptors
> > > > no matter whether we support out-of-order or not.
> > > Maybe I was wrong, but I remember spec mentioned something like this.
> > You're right. Spec mentioned this. I was just repeating
> > the spec to emphasize that it does make sense. :)
> > 
> > > > But combining it with the out-of-order support, it will
> > > > mean that the driver still needs to maintain a desc/ctx
> > > > list that is very similar to the desc ring in the split
> > > > ring. I'm not quite sure whether it's something we want.
> > > > If it is true, I'll do it. So do you think we also want
> > > > to maintain such a desc/ctx list for packed ring?
> > > To make it work for OOO backends I think we need something like this
> > > (hardware NIC drivers are usually have something like this).
> > Which hardware NIC drivers have this?
> 
> It's quite common I think, e.g driver track e.g dma addr and page frag
> somewhere. e.g the ring->rx_info in mlx4 driver.

It seems that I had a misunderstanding on your
previous comments. I know it's quite common for
drivers to track e.g. DMA addrs somewhere (and
I think one reason behind this is that they want
to reuse the bits of addr field). But tracking
addrs somewhere doesn't means supporting OOO.
I thought you were saying it's quite common for
hardware NIC drivers to support OOO (i.e. NICs
will return the descriptors OOO):

I'm not familiar with mlx4, maybe I'm wrong.
I just had a quick glance. And I found below
comments in mlx4_en_process_rx_cq():

```
/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 * descriptor offset can be deduced from the CQE index instead of
 * reading 'cqe->index' */
index = cq->mcq.cons_index & ring->size_mask;
cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
```

It seems that although they have a completion
queue, they are still using the ring in order.

I guess maybe storage device may want OOO.

Best regards,
Tiwei Bie

> 
> Thanks
> 
> > 
> > > Not for the patch, but it looks like having a OUT_OF_ORDER feature bit is
> > > much more simpler to be started with.
> > +1
> > 
> > Best regards,
> > Tiwei Bie
> 
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC v4 3/5] virtio_ring: add packed ring support
From: Jason Wang @ 2018-05-18 13:17 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180518112950.GA28224@debian>



On 2018年05月18日 19:29, Tiwei Bie wrote:
> On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:
>> On 2018年05月16日 22:33, Tiwei Bie wrote:
>>> On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
>>>> On 2018年05月16日 21:45, Tiwei Bie wrote:
>>>>> On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
>>>>>> On 2018年05月16日 20:39, Tiwei Bie wrote:
>>>>>>> On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
>>>>>>>> On 2018年05月16日 16:37, Tiwei Bie wrote:
>>> [...]
>>>>>>>>> +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
>>>>>>>>> +			      unsigned int id, void **ctx)
>>>>>>>>> +{
>>>>>>>>> +	struct vring_packed_desc *desc;
>>>>>>>>> +	unsigned int i, j;
>>>>>>>>> +
>>>>>>>>> +	/* Clear data ptr. */
>>>>>>>>> +	vq->desc_state[id].data = NULL;
>>>>>>>>> +
>>>>>>>>> +	i = head;
>>>>>>>>> +
>>>>>>>>> +	for (j = 0; j < vq->desc_state[id].num; j++) {
>>>>>>>>> +		desc = &vq->vring_packed.desc[i];
>>>>>>>>> +		vring_unmap_one_packed(vq, desc);
>>>>>>>> As mentioned in previous discussion, this probably won't work for the case
>>>>>>>> of out of order completion since it depends on the information in the
>>>>>>>> descriptor ring. We probably need to extend ctx to record such information.
>>>>>>> Above code doesn't depend on the information in the descriptor
>>>>>>> ring. The vq->desc_state[] is the extended ctx.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Tiwei Bie
>>>>>> Yes, but desc is a pointer to descriptor ring I think so
>>>>>> vring_unmap_one_packed() still depends on the content of descriptor ring?
>>>>>>
>>>>> I got your point now. I think it makes sense to reserve
>>>>> the bits of the addr field. Driver shouldn't try to get
>>>>> addrs from the descriptors when cleanup the descriptors
>>>>> no matter whether we support out-of-order or not.
>>>> Maybe I was wrong, but I remember spec mentioned something like this.
>>> You're right. Spec mentioned this. I was just repeating
>>> the spec to emphasize that it does make sense. :)
>>>
>>>>> But combining it with the out-of-order support, it will
>>>>> mean that the driver still needs to maintain a desc/ctx
>>>>> list that is very similar to the desc ring in the split
>>>>> ring. I'm not quite sure whether it's something we want.
>>>>> If it is true, I'll do it. So do you think we also want
>>>>> to maintain such a desc/ctx list for packed ring?
>>>> To make it work for OOO backends I think we need something like this
>>>> (hardware NIC drivers are usually have something like this).
>>> Which hardware NIC drivers have this?
>> It's quite common I think, e.g driver track e.g dma addr and page frag
>> somewhere. e.g the ring->rx_info in mlx4 driver.
> It seems that I had a misunderstanding on your
> previous comments. I know it's quite common for
> drivers to track e.g. DMA addrs somewhere (and
> I think one reason behind this is that they want
> to reuse the bits of addr field).

Yes, we may want this for virtio-net as well in the future.

>   But tracking
> addrs somewhere doesn't means supporting OOO.
> I thought you were saying it's quite common for
> hardware NIC drivers to support OOO (i.e. NICs
> will return the descriptors OOO):
>
> I'm not familiar with mlx4, maybe I'm wrong.
> I just had a quick glance. And I found below
> comments in mlx4_en_process_rx_cq():
>
> ```
> /* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
>   * descriptor offset can be deduced from the CQE index instead of
>   * reading 'cqe->index' */
> index = cq->mcq.cons_index & ring->size_mask;
> cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
> ```
>
> It seems that although they have a completion
> queue, they are still using the ring in order.

I guess so (at least from the above bits). Git grep -i "out of order" in 
drivers/net gives some hints. Looks like there're few deivces do this.

> I guess maybe storage device may want OOO.

Right, some iSCSI did.

But tracking them elsewhere is not only for OOO.

Spec said:

for element address

"
In a used descriptor, Element Address is unused.
"

for Next flag:

"
For example, if descriptors are used in the same order in which they are 
made available, this will result in
the used descriptor overwriting the first available descriptor in the 
list, the used descriptor for the next list
overwriting the first available descriptor in the next list, etc.
"

for in order completion:

"
This will result in the used descriptor overwriting the first available 
descriptor in the batch, the used descriptor
for the next batch overwriting the first available descriptor in the 
next batch, etc.
"

So:

- It's an alignment to the spec
- device may (or should) overwrite the descriptor make also make address 
field useless.

Thanks

>
> Best regards,
> Tiwei Bie
>
>> Thanks
>>
>>>> Not for the patch, but it looks like having a OUT_OF_ORDER feature bit is
>>>> much more simpler to be started with.
>>> +1
>>>
>>> Best regards,
>>> Tiwei Bie

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC v4 3/5] virtio_ring: add packed ring support
From: Tiwei Bie @ 2018-05-18 14:33 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <e247b46a-183f-3959-859b-aebb0bf81c07@redhat.com>

On Fri, May 18, 2018 at 09:17:05PM +0800, Jason Wang wrote:
> On 2018年05月18日 19:29, Tiwei Bie wrote:
> > On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 22:33, Tiwei Bie wrote:
> > > > On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 21:45, Tiwei Bie wrote:
> > > > > > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > > > > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > > > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > > > [...]
> > > > > > > > > > +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
> > > > > > > > > > +			      unsigned int id, void **ctx)
> > > > > > > > > > +{
> > > > > > > > > > +	struct vring_packed_desc *desc;
> > > > > > > > > > +	unsigned int i, j;
> > > > > > > > > > +
> > > > > > > > > > +	/* Clear data ptr. */
> > > > > > > > > > +	vq->desc_state[id].data = NULL;
> > > > > > > > > > +
> > > > > > > > > > +	i = head;
> > > > > > > > > > +
> > > > > > > > > > +	for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > > > > > +		desc = &vq->vring_packed.desc[i];
> > > > > > > > > > +		vring_unmap_one_packed(vq, desc);
> > > > > > > > > As mentioned in previous discussion, this probably won't work for the case
> > > > > > > > > of out of order completion since it depends on the information in the
> > > > > > > > > descriptor ring. We probably need to extend ctx to record such information.
> > > > > > > > Above code doesn't depend on the information in the descriptor
> > > > > > > > ring. The vq->desc_state[] is the extended ctx.
> > > > > > > > 
> > > > > > > > Best regards,
> > > > > > > > Tiwei Bie
> > > > > > > Yes, but desc is a pointer to descriptor ring I think so
> > > > > > > vring_unmap_one_packed() still depends on the content of descriptor ring?
> > > > > > > 
> > > > > > I got your point now. I think it makes sense to reserve
> > > > > > the bits of the addr field. Driver shouldn't try to get
> > > > > > addrs from the descriptors when cleanup the descriptors
> > > > > > no matter whether we support out-of-order or not.
> > > > > Maybe I was wrong, but I remember spec mentioned something like this.
> > > > You're right. Spec mentioned this. I was just repeating
> > > > the spec to emphasize that it does make sense. :)
> > > > 
> > > > > > But combining it with the out-of-order support, it will
> > > > > > mean that the driver still needs to maintain a desc/ctx
> > > > > > list that is very similar to the desc ring in the split
> > > > > > ring. I'm not quite sure whether it's something we want.
> > > > > > If it is true, I'll do it. So do you think we also want
> > > > > > to maintain such a desc/ctx list for packed ring?
> > > > > To make it work for OOO backends I think we need something like this
> > > > > (hardware NIC drivers are usually have something like this).
> > > > Which hardware NIC drivers have this?
> > > It's quite common I think, e.g driver track e.g dma addr and page frag
> > > somewhere. e.g the ring->rx_info in mlx4 driver.
> > It seems that I had a misunderstanding on your
> > previous comments. I know it's quite common for
> > drivers to track e.g. DMA addrs somewhere (and
> > I think one reason behind this is that they want
> > to reuse the bits of addr field).
> 
> Yes, we may want this for virtio-net as well in the future.
> 
> >   But tracking
> > addrs somewhere doesn't means supporting OOO.
> > I thought you were saying it's quite common for
> > hardware NIC drivers to support OOO (i.e. NICs
> > will return the descriptors OOO):
> > 
> > I'm not familiar with mlx4, maybe I'm wrong.
> > I just had a quick glance. And I found below
> > comments in mlx4_en_process_rx_cq():
> > 
> > ```
> > /* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
> >   * descriptor offset can be deduced from the CQE index instead of
> >   * reading 'cqe->index' */
> > index = cq->mcq.cons_index & ring->size_mask;
> > cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
> > ```
> > 
> > It seems that although they have a completion
> > queue, they are still using the ring in order.
> 
> I guess so (at least from the above bits). Git grep -i "out of order" in
> drivers/net gives some hints. Looks like there're few deivces do this.
> 
> > I guess maybe storage device may want OOO.
> 
> Right, some iSCSI did.
> 
> But tracking them elsewhere is not only for OOO.
> 
> Spec said:
> 
> for element address
> 
> "
> In a used descriptor, Element Address is unused.
> "
> 
> for Next flag:
> 
> "
> For example, if descriptors are used in the same order in which they are
> made available, this will result in
> the used descriptor overwriting the first available descriptor in the list,
> the used descriptor for the next list
> overwriting the first available descriptor in the next list, etc.
> "
> 
> for in order completion:
> 
> "
> This will result in the used descriptor overwriting the first available
> descriptor in the batch, the used descriptor
> for the next batch overwriting the first available descriptor in the next
> batch, etc.
> "
> 
> So:
> 
> - It's an alignment to the spec
> - device may (or should) overwrite the descriptor make also make address
> field useless.

You didn't get my point...

I agreed driver should track the DMA addrs or some
other necessary things from the very beginning. And
I also repeated the spec to emphasize that it does
make sense. And I'd like to do that.

What I was saying is that, to support OOO, we may
need to manage these context (which saves DMA addrs
etc) via a list which is similar to the desc list
maintained via `next` in split ring instead of an
array whose elements always can be indexed directly.

The desc ring in split ring is an array, but its
free entries are managed as list via next. I was
just wondering, do we want to manage such a list
because of OOO. It's just a very simple question
that I want to hear your opinion... (It doesn't
means anything, e.g. It doesn't mean I don't want
to support OOO. It's just a simple question...)

Best regards,
Tiwei Bie

> 
> Thanks
> 
> > 
> > Best regards,
> > Tiwei Bie
> > 
> > > Thanks
> > > 
> > > > > Not for the patch, but it looks like having a OUT_OF_ORDER feature bit is
> > > > > much more simpler to be started with.
> > > > +1
> > > > 
> > > > Best regards,
> > > > Tiwei Bie
> 
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC v4 3/5] virtio_ring: add packed ring support
From: Jason Wang @ 2018-05-19  1:12 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180518143334.GA4537@debian>



On 2018年05月18日 22:33, Tiwei Bie wrote:
> On Fri, May 18, 2018 at 09:17:05PM +0800, Jason Wang wrote:
>> On 2018年05月18日 19:29, Tiwei Bie wrote:
>>> On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:
>>>> On 2018年05月16日 22:33, Tiwei Bie wrote:
>>>>> On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
>>>>>> On 2018年05月16日 21:45, Tiwei Bie wrote:
>>>>>>> On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
>>>>>>>> On 2018年05月16日 20:39, Tiwei Bie wrote:
>>>>>>>>> On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
>>>>>>>>>> On 2018年05月16日 16:37, Tiwei Bie wrote:
>>>>> [...]
>>>>>>>>>>> +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
>>>>>>>>>>> +			      unsigned int id, void **ctx)
>>>>>>>>>>> +{
>>>>>>>>>>> +	struct vring_packed_desc *desc;
>>>>>>>>>>> +	unsigned int i, j;
>>>>>>>>>>> +
>>>>>>>>>>> +	/* Clear data ptr. */
>>>>>>>>>>> +	vq->desc_state[id].data = NULL;
>>>>>>>>>>> +
>>>>>>>>>>> +	i = head;
>>>>>>>>>>> +
>>>>>>>>>>> +	for (j = 0; j < vq->desc_state[id].num; j++) {
>>>>>>>>>>> +		desc = &vq->vring_packed.desc[i];
>>>>>>>>>>> +		vring_unmap_one_packed(vq, desc);
>>>>>>>>>> As mentioned in previous discussion, this probably won't work for the case
>>>>>>>>>> of out of order completion since it depends on the information in the
>>>>>>>>>> descriptor ring. We probably need to extend ctx to record such information.
>>>>>>>>> Above code doesn't depend on the information in the descriptor
>>>>>>>>> ring. The vq->desc_state[] is the extended ctx.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Tiwei Bie
>>>>>>>> Yes, but desc is a pointer to descriptor ring I think so
>>>>>>>> vring_unmap_one_packed() still depends on the content of descriptor ring?
>>>>>>>>
>>>>>>> I got your point now. I think it makes sense to reserve
>>>>>>> the bits of the addr field. Driver shouldn't try to get
>>>>>>> addrs from the descriptors when cleanup the descriptors
>>>>>>> no matter whether we support out-of-order or not.
>>>>>> Maybe I was wrong, but I remember spec mentioned something like this.
>>>>> You're right. Spec mentioned this. I was just repeating
>>>>> the spec to emphasize that it does make sense. :)
>>>>>
>>>>>>> But combining it with the out-of-order support, it will
>>>>>>> mean that the driver still needs to maintain a desc/ctx
>>>>>>> list that is very similar to the desc ring in the split
>>>>>>> ring. I'm not quite sure whether it's something we want.
>>>>>>> If it is true, I'll do it. So do you think we also want
>>>>>>> to maintain such a desc/ctx list for packed ring?
>>>>>> To make it work for OOO backends I think we need something like this
>>>>>> (hardware NIC drivers are usually have something like this).
>>>>> Which hardware NIC drivers have this?
>>>> It's quite common I think, e.g driver track e.g dma addr and page frag
>>>> somewhere. e.g the ring->rx_info in mlx4 driver.
>>> It seems that I had a misunderstanding on your
>>> previous comments. I know it's quite common for
>>> drivers to track e.g. DMA addrs somewhere (and
>>> I think one reason behind this is that they want
>>> to reuse the bits of addr field).
>> Yes, we may want this for virtio-net as well in the future.
>>
>>>    But tracking
>>> addrs somewhere doesn't means supporting OOO.
>>> I thought you were saying it's quite common for
>>> hardware NIC drivers to support OOO (i.e. NICs
>>> will return the descriptors OOO):
>>>
>>> I'm not familiar with mlx4, maybe I'm wrong.
>>> I just had a quick glance. And I found below
>>> comments in mlx4_en_process_rx_cq():
>>>
>>> ```
>>> /* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
>>>    * descriptor offset can be deduced from the CQE index instead of
>>>    * reading 'cqe->index' */
>>> index = cq->mcq.cons_index & ring->size_mask;
>>> cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
>>> ```
>>>
>>> It seems that although they have a completion
>>> queue, they are still using the ring in order.
>> I guess so (at least from the above bits). Git grep -i "out of order" in
>> drivers/net gives some hints. Looks like there're few deivces do this.
>>
>>> I guess maybe storage device may want OOO.
>> Right, some iSCSI did.
>>
>> But tracking them elsewhere is not only for OOO.
>>
>> Spec said:
>>
>> for element address
>>
>> "
>> In a used descriptor, Element Address is unused.
>> "
>>
>> for Next flag:
>>
>> "
>> For example, if descriptors are used in the same order in which they are
>> made available, this will result in
>> the used descriptor overwriting the first available descriptor in the list,
>> the used descriptor for the next list
>> overwriting the first available descriptor in the next list, etc.
>> "
>>
>> for in order completion:
>>
>> "
>> This will result in the used descriptor overwriting the first available
>> descriptor in the batch, the used descriptor
>> for the next batch overwriting the first available descriptor in the next
>> batch, etc.
>> "
>>
>> So:
>>
>> - It's an alignment to the spec
>> - device may (or should) overwrite the descriptor make also make address
>> field useless.
> You didn't get my point...

I don't hope so.

> I agreed driver should track the DMA addrs or some
> other necessary things from the very beginning. And
> I also repeated the spec to emphasize that it does
> make sense. And I'd like to do that.
>
> What I was saying is that, to support OOO, we may
> need to manage these context (which saves DMA addrs
> etc) via a list which is similar to the desc list
> maintained via `next` in split ring instead of an
> array whose elements always can be indexed directly.

My point is these context is a must (not only for OOO).

>
> The desc ring in split ring is an array, but its
> free entries are managed as list via next. I was
> just wondering, do we want to manage such a list
> because of OOO. It's just a very simple question
> that I want to hear your opinion... (It doesn't
> means anything, e.g. It doesn't mean I don't want
> to support OOO. It's just a simple question...)

So the question is yes. But I admit I don't have better idea other than 
what you propose here (something like split ring which is a little bit 
sad). Maybe Michael had.

Thanks

>
> Best regards,
> Tiwei Bie
>
>> Thanks
>>
>>> Best regards,
>>> Tiwei Bie
>>>
>>>> Thanks
>>>>
>>>>>> Not for the patch, but it looks like having a OUT_OF_ORDER feature bit is
>>>>>> much more simpler to be started with.
>>>>> +1
>>>>>
>>>>> Best regards,
>>>>> Tiwei Bie

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC v4 3/5] virtio_ring: add packed ring support
From: Tiwei Bie @ 2018-05-19  2:29 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <1a661df0-8ca9-b31d-9c17-8684d608a33a@redhat.com>

On Sat, May 19, 2018 at 09:12:30AM +0800, Jason Wang wrote:
> On 2018年05月18日 22:33, Tiwei Bie wrote:
> > On Fri, May 18, 2018 at 09:17:05PM +0800, Jason Wang wrote:
> > > On 2018年05月18日 19:29, Tiwei Bie wrote:
> > > > On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 22:33, Tiwei Bie wrote:
> > > > > > On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> > > > > > > On 2018年05月16日 21:45, Tiwei Bie wrote:
> > > > > > > > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > > > > > > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > > > > > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > > > > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > > > > > [...]
> > > > > > > > > > > > +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
> > > > > > > > > > > > +			      unsigned int id, void **ctx)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +	struct vring_packed_desc *desc;
> > > > > > > > > > > > +	unsigned int i, j;
> > > > > > > > > > > > +
> > > > > > > > > > > > +	/* Clear data ptr. */
> > > > > > > > > > > > +	vq->desc_state[id].data = NULL;
> > > > > > > > > > > > +
> > > > > > > > > > > > +	i = head;
> > > > > > > > > > > > +
> > > > > > > > > > > > +	for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > > > > > > > +		desc = &vq->vring_packed.desc[i];
> > > > > > > > > > > > +		vring_unmap_one_packed(vq, desc);
> > > > > > > > > > > As mentioned in previous discussion, this probably won't work for the case
> > > > > > > > > > > of out of order completion since it depends on the information in the
> > > > > > > > > > > descriptor ring. We probably need to extend ctx to record such information.
> > > > > > > > > > Above code doesn't depend on the information in the descriptor
> > > > > > > > > > ring. The vq->desc_state[] is the extended ctx.
> > > > > > > > > > 
> > > > > > > > > > Best regards,
> > > > > > > > > > Tiwei Bie
> > > > > > > > > Yes, but desc is a pointer to descriptor ring I think so
> > > > > > > > > vring_unmap_one_packed() still depends on the content of descriptor ring?
> > > > > > > > > 
> > > > > > > > I got your point now. I think it makes sense to reserve
> > > > > > > > the bits of the addr field. Driver shouldn't try to get
> > > > > > > > addrs from the descriptors when cleanup the descriptors
> > > > > > > > no matter whether we support out-of-order or not.
> > > > > > > Maybe I was wrong, but I remember spec mentioned something like this.
> > > > > > You're right. Spec mentioned this. I was just repeating
> > > > > > the spec to emphasize that it does make sense. :)
> > > > > > 
> > > > > > > > But combining it with the out-of-order support, it will
> > > > > > > > mean that the driver still needs to maintain a desc/ctx
> > > > > > > > list that is very similar to the desc ring in the split
> > > > > > > > ring. I'm not quite sure whether it's something we want.
> > > > > > > > If it is true, I'll do it. So do you think we also want
> > > > > > > > to maintain such a desc/ctx list for packed ring?
> > > > > > > To make it work for OOO backends I think we need something like this
> > > > > > > (hardware NIC drivers are usually have something like this).
> > > > > > Which hardware NIC drivers have this?
> > > > > It's quite common I think, e.g driver track e.g dma addr and page frag
> > > > > somewhere. e.g the ring->rx_info in mlx4 driver.
> > > > It seems that I had a misunderstanding on your
> > > > previous comments. I know it's quite common for
> > > > drivers to track e.g. DMA addrs somewhere (and
> > > > I think one reason behind this is that they want
> > > > to reuse the bits of addr field).
> > > Yes, we may want this for virtio-net as well in the future.
> > > 
> > > >    But tracking
> > > > addrs somewhere doesn't means supporting OOO.
> > > > I thought you were saying it's quite common for
> > > > hardware NIC drivers to support OOO (i.e. NICs
> > > > will return the descriptors OOO):
> > > > 
> > > > I'm not familiar with mlx4, maybe I'm wrong.
> > > > I just had a quick glance. And I found below
> > > > comments in mlx4_en_process_rx_cq():
> > > > 
> > > > ```
> > > > /* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
> > > >    * descriptor offset can be deduced from the CQE index instead of
> > > >    * reading 'cqe->index' */
> > > > index = cq->mcq.cons_index & ring->size_mask;
> > > > cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
> > > > ```
> > > > 
> > > > It seems that although they have a completion
> > > > queue, they are still using the ring in order.
> > > I guess so (at least from the above bits). Git grep -i "out of order" in
> > > drivers/net gives some hints. Looks like there're few deivces do this.
> > > 
> > > > I guess maybe storage device may want OOO.
> > > Right, some iSCSI did.
> > > 
> > > But tracking them elsewhere is not only for OOO.
> > > 
> > > Spec said:
> > > 
> > > for element address
> > > 
> > > "
> > > In a used descriptor, Element Address is unused.
> > > "
> > > 
> > > for Next flag:
> > > 
> > > "
> > > For example, if descriptors are used in the same order in which they are
> > > made available, this will result in
> > > the used descriptor overwriting the first available descriptor in the list,
> > > the used descriptor for the next list
> > > overwriting the first available descriptor in the next list, etc.
> > > "
> > > 
> > > for in order completion:
> > > 
> > > "
> > > This will result in the used descriptor overwriting the first available
> > > descriptor in the batch, the used descriptor
> > > for the next batch overwriting the first available descriptor in the next
> > > batch, etc.
> > > "
> > > 
> > > So:
> > > 
> > > - It's an alignment to the spec
> > > - device may (or should) overwrite the descriptor make also make address
> > > field useless.
> > You didn't get my point...
> 
> I don't hope so.
> 
> > I agreed driver should track the DMA addrs or some
> > other necessary things from the very beginning. And
> > I also repeated the spec to emphasize that it does
> > make sense. And I'd like to do that.
> > 
> > What I was saying is that, to support OOO, we may
> > need to manage these context (which saves DMA addrs
> > etc) via a list which is similar to the desc list
> > maintained via `next` in split ring instead of an
> > array whose elements always can be indexed directly.
> 
> My point is these context is a must (not only for OOO).

Yeah, and I have the exactly same point after you
pointed that I shouldn't get the addrs from descs.
I do think it makes sense. I'll do it in the next
version. I don't have any doubt about it. All my
questions are about the OOO, instead of whether we
should save context or not. It just seems that you
thought I don't want to do it, and were trying to
convince me that I should do it.

> 
> > 
> > The desc ring in split ring is an array, but its
> > free entries are managed as list via next. I was
> > just wondering, do we want to manage such a list
> > because of OOO. It's just a very simple question
> > that I want to hear your opinion... (It doesn't
> > means anything, e.g. It doesn't mean I don't want
> > to support OOO. It's just a simple question...)
> 
> So the question is yes. But I admit I don't have better idea other than what
> you propose here (something like split ring which is a little bit sad).
> Maybe Michael had.

Yeah, that's why I asked this question. It will
make the packed ring a bit similar to split ring
at least in the driver part. So I want to draw
your attention on this to make sure that we're
on the same page.

Best regards,
Tiwei Bie

> 
> Thanks
> 
> > 
> > Best regards,
> > Tiwei Bie
> > 
> > > Thanks
> > > 
> > > > Best regards,
> > > > Tiwei Bie
> > > > 
> > > > > Thanks
> > > > > 
> > > > > > > Not for the patch, but it looks like having a OUT_OF_ORDER feature bit is
> > > > > > > much more simpler to be started with.
> > > > > > +1
> > > > > > 
> > > > > > Best regards,
> > > > > > Tiwei Bie
> 
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC V4 PATCH 0/8] Packed ring layout for vhost
From: Wei Xu @ 2018-05-20 16:25 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, mst, netdev, linux-kernel, virtualization
In-Reply-To: <1526473941-16199-1-git-send-email-jasowang@redhat.com>

On Wed, May 16, 2018 at 08:32:13PM +0800, Jason Wang wrote:
> Hi all:
> 
> This RFC implement packed ring layout. The code were tested with
> Tiwei's RFC V3 ahttps://lkml.org/lkml/2018/4/25/34. Some fixups and
> tweaks were needed on top of Tiwei's code to make it run for event
> index.

Could you please show the change based on Tiwei's code to easy other's
test?

> 
> Pktgen reports about 20% improvement on PPS (event index is off). More
> testing is ongoing.
> 
> Notes for tester:
> 
> - Start from this version, vhost need qemu co-operation to work
>   correctly. Or you can comment out the packed specific code for
>   GET/SET_VRING_BASE.

Do you mean the code in vhost_virtqueue_start/stop? Both Tiwei's and your v3
work fortunately correctly which should be avoided since the ring should be
definitely different.

Wei

> 
> Changes from V3:
> - Fix math on event idx checking
> - Sync last avail wrap counter through GET/SET_VRING_BASE
> - remove desc_event prefix in the driver/device structure
> 
> Changes from V2:
> - do not use & in checking desc_event_flags
> - off should be most significant bit
> - remove the workaround of mergeable buffer for dpdk prototype
> - id should be in the last descriptor in the chain
> - keep _F_WRITE for write descriptor when adding used
> - device flags updating should use ADDR_USED type
> - return error on unexpected unavail descriptor in a chain
> - return false in vhost_ve_avail_empty is descriptor is available
> - track last seen avail_wrap_counter
> - correctly examine available descriptor in get_indirect_packed()
> - vhost_idx_diff should return u16 instead of bool
> 
> Changes from V1:
> 
> - Refactor vhost used elem code to avoid open coding on used elem
> - Event suppression support (compile test only).
> - Indirect descriptor support (compile test only).
> - Zerocopy support.
> - vIOMMU support.
> - SCSI/VSOCK support (compile test only).
> - Fix several bugs
> 
> Jason Wang (8):
>   vhost: move get_rx_bufs to vhost.c
>   vhost: hide used ring layout from device
>   vhost: do not use vring_used_elem
>   vhost_net: do not explicitly manipulate vhost_used_elem
>   vhost: vhost_put_user() can accept metadata type
>   virtio: introduce packed ring defines
>   vhost: packed ring support
>   vhost: event suppression for packed ring
> 
>  drivers/vhost/net.c                | 136 ++----
>  drivers/vhost/scsi.c               |  62 +--
>  drivers/vhost/vhost.c              | 861 ++++++++++++++++++++++++++++++++-----
>  drivers/vhost/vhost.h              |  47 +-
>  drivers/vhost/vsock.c              |  42 +-
>  include/uapi/linux/virtio_config.h |   9 +
>  include/uapi/linux/virtio_ring.h   |  32 ++
>  7 files changed, 928 insertions(+), 261 deletions(-)
> 
> -- 
> 2.7.4
> 

^ permalink raw reply

* Re: [RFC v4 3/5] virtio_ring: add packed ring support
From: Jason Wang @ 2018-05-21  2:30 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180519022938.GA18888@debian>



On 2018年05月19日 10:29, Tiwei Bie wrote:
>> I don't hope so.
>>
>>> I agreed driver should track the DMA addrs or some
>>> other necessary things from the very beginning. And
>>> I also repeated the spec to emphasize that it does
>>> make sense. And I'd like to do that.
>>>
>>> What I was saying is that, to support OOO, we may
>>> need to manage these context (which saves DMA addrs
>>> etc) via a list which is similar to the desc list
>>> maintained via `next` in split ring instead of an
>>> array whose elements always can be indexed directly.
>> My point is these context is a must (not only for OOO).
> Yeah, and I have the exactly same point after you
> pointed that I shouldn't get the addrs from descs.
> I do think it makes sense. I'll do it in the next
> version. I don't have any doubt about it. All my
> questions are about the OOO, instead of whether we
> should save context or not. It just seems that you
> thought I don't want to do it, and were trying to
> convince me that I should do it.

Right, but looks like I was wrong :)

>
>>> The desc ring in split ring is an array, but its
>>> free entries are managed as list via next. I was
>>> just wondering, do we want to manage such a list
>>> because of OOO. It's just a very simple question
>>> that I want to hear your opinion... (It doesn't
>>> means anything, e.g. It doesn't mean I don't want
>>> to support OOO. It's just a simple question...)
>> So the question is yes. But I admit I don't have better idea other than what
>> you propose here (something like split ring which is a little bit sad).
>> Maybe Michael had.
> Yeah, that's why I asked this question. It will
> make the packed ring a bit similar to split ring
> at least in the driver part. So I want to draw
> your attention on this to make sure that we're
> on the same page.

Yes. I think we are.

Thanks

> Best regards,
> Tiwei Bie
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC V4 PATCH 0/8] Packed ring layout for vhost
From: Jason Wang @ 2018-05-21  2:33 UTC (permalink / raw)
  To: Wei Xu; +Cc: kvm, mst, netdev, linux-kernel, virtualization
In-Reply-To: <20180520162514.GA21958@wei-ubt>



On 2018年05月21日 00:25, Wei Xu wrote:
> On Wed, May 16, 2018 at 08:32:13PM +0800, Jason Wang wrote:
>> Hi all:
>>
>> This RFC implement packed ring layout. The code were tested with
>> Tiwei's RFC V3 ahttps://lkml.org/lkml/2018/4/25/34. Some fixups and
>> tweaks were needed on top of Tiwei's code to make it run for event
>> index.
> Could you please show the change based on Tiwei's code to easy other's
> test?

Please try Tiwei's V4 instead of just waiting for the fixup. It should 
work unless you don't try zerocopy and vIOMMU.


>> Pktgen reports about 20% improvement on PPS (event index is off). More
>> testing is ongoing.
>>
>> Notes for tester:
>>
>> - Start from this version, vhost need qemu co-operation to work
>>    correctly. Or you can comment out the packed specific code for
>>    GET/SET_VRING_BASE.
> Do you mean the code in vhost_virtqueue_start/stop?

For qemu, probably.

> Both Tiwei's and your v3
> work fortunately correctly which should be avoided since the ring should be
> definitely different.

I don't understand this, you mean reset work?

Thanks

>
> Wei
>
>> Changes from V3:
>> - Fix math on event idx checking
>> - Sync last avail wrap counter through GET/SET_VRING_BASE
>> - remove desc_event prefix in the driver/device structure
>>
>> Changes from V2:
>> - do not use & in checking desc_event_flags
>> - off should be most significant bit
>> - remove the workaround of mergeable buffer for dpdk prototype
>> - id should be in the last descriptor in the chain
>> - keep _F_WRITE for write descriptor when adding used
>> - device flags updating should use ADDR_USED type
>> - return error on unexpected unavail descriptor in a chain
>> - return false in vhost_ve_avail_empty is descriptor is available
>> - track last seen avail_wrap_counter
>> - correctly examine available descriptor in get_indirect_packed()
>> - vhost_idx_diff should return u16 instead of bool
>>
>> Changes from V1:
>>
>> - Refactor vhost used elem code to avoid open coding on used elem
>> - Event suppression support (compile test only).
>> - Indirect descriptor support (compile test only).
>> - Zerocopy support.
>> - vIOMMU support.
>> - SCSI/VSOCK support (compile test only).
>> - Fix several bugs
>>
>> Jason Wang (8):
>>    vhost: move get_rx_bufs to vhost.c
>>    vhost: hide used ring layout from device
>>    vhost: do not use vring_used_elem
>>    vhost_net: do not explicitly manipulate vhost_used_elem
>>    vhost: vhost_put_user() can accept metadata type
>>    virtio: introduce packed ring defines
>>    vhost: packed ring support
>>    vhost: event suppression for packed ring
>>
>>   drivers/vhost/net.c                | 136 ++----
>>   drivers/vhost/scsi.c               |  62 +--
>>   drivers/vhost/vhost.c              | 861 ++++++++++++++++++++++++++++++++-----
>>   drivers/vhost/vhost.h              |  47 +-
>>   drivers/vhost/vsock.c              |  42 +-
>>   include/uapi/linux/virtio_config.h |   9 +
>>   include/uapi/linux/virtio_ring.h   |  32 ++
>>   7 files changed, 928 insertions(+), 261 deletions(-)
>>
>> -- 
>> 2.7.4
>>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: KASAN: use-after-free Read in vhost_chr_write_iter
From: Jason Wang @ 2018-05-21  2:38 UTC (permalink / raw)
  To: DaeRyong Jeong, mst
  Cc: bammanag, kt0755, kvm, netdev, linux-kernel, virtualization,
	byoungyoung
In-Reply-To: <58419d62-3074-2e5a-8504-da1cdeb08280@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 1988 bytes --]



On 2018年05月18日 17:24, Jason Wang wrote:
>
>
> On 2018年05月17日 21:45, DaeRyong Jeong wrote:
>> We report the crash: KASAN: use-after-free Read in vhost_chr_write_iter
>>
>> This crash has been found in v4.17-rc1 using RaceFuzzer (a modified
>> version of Syzkaller), which we describe more at the end of this
>> report. Our analysis shows that the race occurs when invoking two
>> syscalls concurrently, write$vnet and ioctl$VHOST_RESET_OWNER.
>>
>>
>> Analysis:
>> We think the concurrent execution of vhost_process_iotlb_msg() and
>> vhost_dev_cleanup() causes the crash.
>> Both of functions can run concurrently (please see call sequence below),
>> and possibly, there is a race on dev->iotlb.
>> If the switch occurs right after vhost_dev_cleanup() frees
>> dev->iotlb, vhost_process_iotlb_msg() still sees the non-null value 
>> and it
>> keep executing without returning -EFAULT. Consequently, use-after-free
>> occures
>>
>>
>> Thread interleaving:
>> CPU0 (vhost_process_iotlb_msg)                CPU1 (vhost_dev_cleanup)
>> (In the case of both VHOST_IOTLB_UPDATE and
>> VHOST_IOTLB_INVALIDATE)
>> =====                            =====
>>                             vhost_umem_clean(dev->iotlb);
>> if (!dev->iotlb) {
>>             ret = -EFAULT;
>>                 break;
>> }
>>                             dev->iotlb = NULL;
>>
>>
>> Call Sequence:
>> CPU0
>> =====
>> vhost_net_chr_write_iter
>>     vhost_chr_write_iter
>>         vhost_process_iotlb_msg
>>
>> CPU1
>> =====
>> vhost_net_ioctl
>>     vhost_net_reset_owner
>>         vhost_dev_reset_owner
>>             vhost_dev_cleanup
>
> Thanks a lot for the analysis.
>
> This could be addressed by simply protect it with dev mutex.
>
> Will post a patch.
>

Could you please help to test the attached patch? I've done some smoking 
test.

Thanks

[-- Attachment #2: 0001-vhost-synchronize-IOTLB-message-with-dev-cleanup.patch --]
[-- Type: text/x-patch, Size: 1507 bytes --]

From 88328386f3f652e684ee33dc4cf63dcaed871aea Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang@redhat.com>
Date: Fri, 18 May 2018 17:33:27 +0800
Subject: [PATCH] vhost: synchronize IOTLB message with dev cleanup

DaeRyong Jeong reports a race between vhost_dev_cleanup() and
vhost_process_iotlb_msg():

Thread interleaving:
CPU0 (vhost_process_iotlb_msg)			CPU1 (vhost_dev_cleanup)
(In the case of both VHOST_IOTLB_UPDATE and
VHOST_IOTLB_INVALIDATE)
=====						=====
						vhost_umem_clean(dev->iotlb);
if (!dev->iotlb) {
	        ret = -EFAULT;
		        break;
}
						dev->iotlb = NULL;

The reason is we don't synchronize between them, fixing by protecting
vhost_process_iotlb_msg() with dev mutex.

Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
---
 drivers/vhost/vhost.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f3bd8e9..f0be5f3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -981,6 +981,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 {
 	int ret = 0;
 
+	mutex_lock(&dev->mutex);
 	vhost_dev_lock_vqs(dev);
 	switch (msg->type) {
 	case VHOST_IOTLB_UPDATE:
@@ -1016,6 +1017,8 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 	}
 
 	vhost_dev_unlock_vqs(dev);
+	mutex_unlock(&dev->mutex);
+
 	return ret;
 }
 ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
-- 
2.7.4


[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related

* Re: [RFC v4 3/5] virtio_ring: add packed ring support
From: Tiwei Bie @ 2018-05-21  2:39 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <b8c0d147-2711-cb26-a084-e7d3d32d0002@redhat.com>

On Mon, May 21, 2018 at 10:30:51AM +0800, Jason Wang wrote:
> On 2018年05月19日 10:29, Tiwei Bie wrote:
> > > I don't hope so.
> > > 
> > > > I agreed driver should track the DMA addrs or some
> > > > other necessary things from the very beginning. And
> > > > I also repeated the spec to emphasize that it does
> > > > make sense. And I'd like to do that.
> > > > 
> > > > What I was saying is that, to support OOO, we may
> > > > need to manage these context (which saves DMA addrs
> > > > etc) via a list which is similar to the desc list
> > > > maintained via `next` in split ring instead of an
> > > > array whose elements always can be indexed directly.
> > > My point is these context is a must (not only for OOO).
> > Yeah, and I have the exactly same point after you
> > pointed that I shouldn't get the addrs from descs.
> > I do think it makes sense. I'll do it in the next
> > version. I don't have any doubt about it. All my
> > questions are about the OOO, instead of whether we
> > should save context or not. It just seems that you
> > thought I don't want to do it, and were trying to
> > convince me that I should do it.
> 
> Right, but looks like I was wrong :)
> 
> > 
> > > > The desc ring in split ring is an array, but its
> > > > free entries are managed as list via next. I was
> > > > just wondering, do we want to manage such a list
> > > > because of OOO. It's just a very simple question
> > > > that I want to hear your opinion... (It doesn't
> > > > means anything, e.g. It doesn't mean I don't want
> > > > to support OOO. It's just a simple question...)
> > > So the question is yes. But I admit I don't have better idea other than what
> > > you propose here (something like split ring which is a little bit sad).
> > > Maybe Michael had.
> > Yeah, that's why I asked this question. It will
> > make the packed ring a bit similar to split ring
> > at least in the driver part. So I want to draw
> > your attention on this to make sure that we're
> > on the same page.
> 
> Yes. I think we are.

Cool. Glad to hear that! Thanks! :)

Best regards,
Tiwei Bie

> 
> Thanks
> 
> > Best regards,
> > Tiwei Bie
> > 
> 
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [PATCH net 0/4] Fix several issues of virtio-net mergeable XDP
From: Jason Wang @ 2018-05-21  8:35 UTC (permalink / raw)
  To: mst, jasowang; +Cc: netdev, linux-kernel, virtualization

Hi:

Please review the patches that tries to fix sevreal issues of
virtio-net mergeable XDP.

Thanks

Jason Wang (4):
  virtio-net: correctly redirect linearized packet
  virtio-net: correctly transmit XDP buff after linearizing
  virtio-net: reset num_buf to 1 after linearizing packet
  virito-net: fix leaking page for gso packet during mergeable XDP

 drivers/net/virtio_net.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH net 1/4] virtio-net: correctly redirect linearized packet
From: Jason Wang @ 2018-05-21  8:35 UTC (permalink / raw)
  To: mst, jasowang; +Cc: netdev, linux-kernel, virtualization
In-Reply-To: <1526891706-18516-1-git-send-email-jasowang@redhat.com>

After a linearized packet was redirected by XDP, we should not go for
the err path which will try to pop buffers for the next packet and
increase the drop counter. Fixing this by just drop the page refcnt
for the original page.

Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
Reported-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 770422e..c15d240 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -787,7 +787,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			}
 			*xdp_xmit = true;
 			if (unlikely(xdp_page != page))
-				goto err_xdp;
+				put_page(page);
 			rcu_read_unlock();
 			goto xdp_xmit;
 		default:
-- 
2.7.4

^ permalink raw reply related

* [PATCH net 2/4] virtio-net: correctly transmit XDP buff after linearizing
From: Jason Wang @ 2018-05-21  8:35 UTC (permalink / raw)
  To: mst, jasowang; +Cc: netdev, John Fastabend, linux-kernel, virtualization
In-Reply-To: <1526891706-18516-1-git-send-email-jasowang@redhat.com>

We should not go for the error path after successfully transmitting a
XDP buffer after linearizing. Since the error path may try to pop and
drop next packet and increase the drop counters. Fixing this by simply
drop the refcnt of original page and go for xmit path.

Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous buffers")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c15d240..6260d65 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -775,7 +775,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			}
 			*xdp_xmit = true;
 			if (unlikely(xdp_page != page))
-				goto err_xdp;
+				put_page(page);
 			rcu_read_unlock();
 			goto xdp_xmit;
 		case XDP_REDIRECT:
-- 
2.7.4

^ permalink raw reply related

* [PATCH net 3/4] virtio-net: reset num_buf to 1 after linearizing packet
From: Jason Wang @ 2018-05-21  8:35 UTC (permalink / raw)
  To: mst, jasowang; +Cc: netdev, linux-kernel, virtualization
In-Reply-To: <1526891706-18516-1-git-send-email-jasowang@redhat.com>

If we successfully linearize the packets, num_buf were set to zero
which was wrong since we now have only 1 buffer to be used for e.g in
the error path of receive_mergeable(). Zero num_buf will lead the code
try to pop the buffers of next packet and drop it. Fixing this by set
num_buf to 1 if we successfully linearize the packet.

Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6260d65..165a922 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -722,6 +722,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 						      &len);
 			if (!xdp_page)
 				goto err_xdp;
+			num_buf = 1;
 			offset = VIRTIO_XDP_HEADROOM;
 		} else {
 			xdp_page = page;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net 4/4] virito-net: fix leaking page for gso packet during mergeable XDP
From: Jason Wang @ 2018-05-21  8:35 UTC (permalink / raw)
  To: mst, jasowang; +Cc: netdev, John Fastabend, linux-kernel, virtualization
In-Reply-To: <1526891706-18516-1-git-send-email-jasowang@redhat.com>

We need to drop refcnt to xdp_page if we see a gso packet. Otherwise
it will be leaked. Fixing this by moving the check of gso packet above
the linearizing logic.

Cc: John Fastabend <john.fastabend@gmail.com>
Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous buffers")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 165a922..f8db809 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -707,6 +707,14 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		void *data;
 		u32 act;
 
+		/* Transient failure which in theory could occur if
+		 * in-flight packets from before XDP was enabled reach
+		 * the receive path after XDP is loaded. In practice I
+		 * was not able to create this condition.
+		 */
+		if (unlikely(hdr->hdr.gso_type))
+			goto err_xdp;
+
 		/* This happens when rx buffer size is underestimated
 		 * or headroom is not enough because of the buffer
 		 * was refilled before XDP is set. This should only
@@ -728,14 +736,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			xdp_page = page;
 		}
 
-		/* Transient failure which in theory could occur if
-		 * in-flight packets from before XDP was enabled reach
-		 * the receive path after XDP is loaded. In practice I
-		 * was not able to create this condition.
-		 */
-		if (unlikely(hdr->hdr.gso_type))
-			goto err_xdp;
-
 		/* Allow consuming headroom but reserve enough space to push
 		 * the descriptor on if we get an XDP_TX return code.
 		 */
-- 
2.7.4

^ permalink raw reply related

* [RFC PATCH net-next 00/12] XDP batching for TUN/vhost_net
From: Jason Wang @ 2018-05-21  9:04 UTC (permalink / raw)
  To: mst, jasowang; +Cc: netdev, linux-kernel, kvm, virtualization

Hi all:

We do not support XDP batching for TUN since it can only receive one
packet a time from vhost_net. This series tries to remove this
limitation by:

- introduce a TUN specific msg_control that can hold a pointer to an
  array of XDP buffs
- try copy and build XDP buff in vhost_net
- store XDP buffs in an array and submit them once for every N packets
  from vhost_net
- since TUN can only do native XDP for datacopy packet, to simplify
  the logic, split datacopy out logic and only do batching for
  datacopy.

With this series, TX PPS can improve about 34% from 2.9Mpps to
3.9Mpps when doing xdp_redirect_map between TAP and ixgbe.

Thanks

Jason Wang (12):
  vhost_net: introduce helper to initialize tx iov iter
  vhost_net: introduce vhost_exceeds_weight()
  vhost_net: introduce vhost_has_more_pkts()
  vhost_net: split out datacopy logic
  vhost_net: batch update used ring for datacopy TX
  tuntap: enable premmption early
  tuntap: simplify error handling in tun_build_skb()
  tuntap: tweak on the path of non-xdp case in tun_build_skb()
  tuntap: split out XDP logic
  vhost_net: build xdp buff
  vhost_net: passing raw xdp buff to tun
  vhost_net: batch submitting XDP buffers to underlayer sockets

 drivers/net/tun.c      | 226 +++++++++++++++++++++++++++----------
 drivers/vhost/net.c    | 297 ++++++++++++++++++++++++++++++++++++++++++++-----
 include/linux/if_tun.h |   7 ++
 3 files changed, 444 insertions(+), 86 deletions(-)

-- 
2.7.4

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox