netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
@ 2014-02-28 19:16 Zoltan Kiss
  2014-03-06 17:09 ` Zoltan Kiss
       [not found] ` <1393615016-9187-1-git-send-email-zoltan.kiss-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
  0 siblings, 2 replies; 15+ messages in thread
From: Zoltan Kiss @ 2014-02-28 19:16 UTC (permalink / raw)
  To: Jesse Gross, pshelar-l0M0P4e3n4LQT0dZR+AlfA,
	tgraf-H+wXaHxf7aLQT0dZR+AlfA, dev-yBygre7rU0TnMu66kgdUjQ
  Cc: xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA

The kernel datapath now switched to zerocopy Netlink messages, but that also
means that the pages on frags array are sent straight to userspace. If those
pages came outside the kernel, we have to swap them out with local copies.

Signed-off-by: Zoltan Kiss <zoltan.kiss-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
---
 net/openvswitch/datapath.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 36f8872..ffb563c 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -464,6 +464,12 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
 	}
 	nla->nla_len = nla_attr_size(skb->len);
 
+	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) {
+		err = -ENOMEM;
+		skb_tx_error(skb);
+		goto out;
+	}
+
 	skb_zerocopy(user_skb, skb, skb->len, hlen);
 
 	/* Pad OVS_PACKET_ATTR_PACKET if linear copy was performed */

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
  2014-02-28 19:16 [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall Zoltan Kiss
@ 2014-03-06 17:09 ` Zoltan Kiss
  2014-03-07  4:46   ` Pravin Shelar
       [not found] ` <1393615016-9187-1-git-send-email-zoltan.kiss-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 15+ messages in thread
From: Zoltan Kiss @ 2014-03-06 17:09 UTC (permalink / raw)
  To: Jesse Gross, pshelar, tgraf, dev; +Cc: xen-devel, netdev, linux-kernel, kvm

Do you have any feedback on this? I'm also adding KVM list as they might 
be interested in this.

Zoli

On 28/02/14 19:16, Zoltan Kiss wrote:
> The kernel datapath now switched to zerocopy Netlink messages, but that also
> means that the pages on frags array are sent straight to userspace. If those
> pages came outside the kernel, we have to swap them out with local copies.
>
> Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
> ---
>   net/openvswitch/datapath.c |    6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 36f8872..ffb563c 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -464,6 +464,12 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
>   	}
>   	nla->nla_len = nla_attr_size(skb->len);
>
> +	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) {
> +		err = -ENOMEM;
> +		skb_tx_error(skb);
> +		goto out;
> +	}
> +
>   	skb_zerocopy(user_skb, skb, skb->len, hlen);
>
>   	/* Pad OVS_PACKET_ATTR_PACKET if linear copy was performed */
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
  2014-03-06 17:09 ` Zoltan Kiss
@ 2014-03-07  4:46   ` Pravin Shelar
  2014-03-07 12:29     ` Zoltan Kiss
       [not found]     ` <CALnjE+rWc=n_F+1jSLQtPrgKSvvxONEkkYxWEHon2_KVNG9z3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 2 replies; 15+ messages in thread
From: Pravin Shelar @ 2014-03-07  4:46 UTC (permalink / raw)
  To: Zoltan Kiss
  Cc: Jesse Gross, Thomas Graf, dev@openvswitch.org, xen-devel, netdev,
	LKML, kvm

On Thu, Mar 6, 2014 at 9:09 AM, Zoltan Kiss <zoltan.kiss@citrix.com> wrote:
> Do you have any feedback on this? I'm also adding KVM list as they might be
> interested in this.
>
> Zoli
>
>
> On 28/02/14 19:16, Zoltan Kiss wrote:
>>
>> The kernel datapath now switched to zerocopy Netlink messages, but that
>> also
>> means that the pages on frags array are sent straight to userspace. If
>> those
>> pages came outside the kernel, we have to swap them out with local copies.
>>
>> Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>

I do not think this is required, netlink zero copy only maps
pre-allocated buffers to user-space.

But I found bug in datapath user-space queue code. I am not sure how
this can work with skb fragments and MMAP-netlink socket.
Here is what happens, OVS allocates netlink skb and adds fragments to
skb using skb_zero_copy(), then calls genlmsg_unicast().
But if netlink sock is mmped then netlink-send queues netlink
allocated skb->head (linear data of skb) and ignore skb frags.

Currently this is not problem with OVS vswitchd since it does not use
netlink MMAP sockets. But if vswitchd stats using MMAP-netlink socket,
it can break it.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
  2014-03-07  4:46   ` Pravin Shelar
@ 2014-03-07 12:29     ` Zoltan Kiss
       [not found]       ` <5319BBAE.7030109-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
       [not found]     ` <CALnjE+rWc=n_F+1jSLQtPrgKSvvxONEkkYxWEHon2_KVNG9z3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 15+ messages in thread
From: Zoltan Kiss @ 2014-03-07 12:29 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: Jesse Gross, Thomas Graf, dev@openvswitch.org, xen-devel, netdev,
	LKML, kvm

On 07/03/14 04:46, Pravin Shelar wrote:
> On Thu, Mar 6, 2014 at 9:09 AM, Zoltan Kiss <zoltan.kiss@citrix.com> wrote:
>> Do you have any feedback on this? I'm also adding KVM list as they might be
>> interested in this.
>>
>> Zoli
>>
>>
>> On 28/02/14 19:16, Zoltan Kiss wrote:
>>>
>>> The kernel datapath now switched to zerocopy Netlink messages, but that
>>> also
>>> means that the pages on frags array are sent straight to userspace. If
>>> those
>>> pages came outside the kernel, we have to swap them out with local copies.
>>>
>>> Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
>
> I do not think this is required, netlink zero copy only maps
> pre-allocated buffers to user-space.
How do you mean "pre-allocated"? By who?

As far as I've seen the skb in this function came straight from the 
device (vif in our case), and skb_zerocopy just copy the frags to 
user_skb, which is sent to the userspace. Those frags contain pages from 
guest, and it's a bad idea to pass them to userspace: e.g if userspace 
dies in the meantime, what happens with them? Also, in Xen's case they 
are actually not mapped to userspace, so accessing them can lead to garbage.

Zoli

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
       [not found]     ` <CALnjE+rWc=n_F+1jSLQtPrgKSvvxONEkkYxWEHon2_KVNG9z3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-03-07 15:58       ` Thomas Graf
       [not found]         ` <5319EC8E.2010606-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Graf @ 2014-03-07 15:58 UTC (permalink / raw)
  To: Pravin Shelar, Zoltan Kiss
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	kvm-u79uwXL29TY76Z2rM5mHXA, netdev, LKML,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b

On 03/07/2014 05:46 AM, Pravin Shelar wrote:
> But I found bug in datapath user-space queue code. I am not sure how
> this can work with skb fragments and MMAP-netlink socket.
> Here is what happens, OVS allocates netlink skb and adds fragments to
> skb using skb_zero_copy(), then calls genlmsg_unicast().
> But if netlink sock is mmped then netlink-send queues netlink
> allocated skb->head (linear data of skb) and ignore skb frags.
>
> Currently this is not problem with OVS vswitchd since it does not use
> netlink MMAP sockets. But if vswitchd stats using MMAP-netlink socket,
> it can break it.

The secret is out ;-)

I was very surprised too when I noticed that it worked. It's not just
OVS, it's nfqueue as well. The reason is that an netlink mmaped skb is
setup with a giant tailroom in netlink_ring_setup_skb():

	skb->end	= skb->tail + size;

and skb_zerocopy() will consume whatever tailroom is available first:

	/* dont bother with small payloads */
	if (len <= skb_tailroom(to)) {
		skb_copy_bits(from, 0, skb_put(to, len), len);
		return;
	}

I was planning to fix this while adding GSO support to the upcall as
that is the moment when this bug would really surface.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
       [not found] ` <1393615016-9187-1-git-send-email-zoltan.kiss-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
@ 2014-03-07 16:23   ` Thomas Graf
       [not found]     ` <5319F272.1070101-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Graf @ 2014-03-07 16:23 UTC (permalink / raw)
  To: Zoltan Kiss, Jesse Gross, pshelar-l0M0P4e3n4LQT0dZR+AlfA,
	dev-yBygre7rU0TnMu66kgdUjQ
  Cc: xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA

On 02/28/2014 08:16 PM, Zoltan Kiss wrote:
> The kernel datapath now switched to zerocopy Netlink messages, but that also
> means that the pages on frags array are sent straight to userspace. If those
> pages came outside the kernel, we have to swap them out with local copies.
>
> Signed-off-by: Zoltan Kiss <zoltan.kiss-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
> ---
>   net/openvswitch/datapath.c |    6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 36f8872..ffb563c 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -464,6 +464,12 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
>   	}
>   	nla->nla_len = nla_attr_size(skb->len);
>
> +	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) {
> +		err = -ENOMEM;
> +		skb_tx_error(skb);
> +		goto out;
> +	}
> +
>   	skb_zerocopy(user_skb, skb, skb->len, hlen);
>
>   	/* Pad OVS_PACKET_ATTR_PACKET if linear copy was performed */

How about we move the call to skb_orphan_frags() into skb_zerocopy()
itself and call it before we actually reference the frags?

It's not just OVS affected but nfqueue as well.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
       [not found]         ` <5319EC8E.2010606-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-03-07 17:19           ` Pravin Shelar
       [not found]             ` <CALnjE+oDM=ga_C6T_-9i2UNwv=K4g-+y-LJA04nh+=WmoeuNXw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Pravin Shelar @ 2014-03-07 17:19 UTC (permalink / raw)
  To: Thomas Graf
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	kvm-u79uwXL29TY76Z2rM5mHXA, netdev, LKML,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b

On Fri, Mar 7, 2014 at 7:58 AM, Thomas Graf <tgraf-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 03/07/2014 05:46 AM, Pravin Shelar wrote:
>>
>> But I found bug in datapath user-space queue code. I am not sure how
>> this can work with skb fragments and MMAP-netlink socket.
>> Here is what happens, OVS allocates netlink skb and adds fragments to
>> skb using skb_zero_copy(), then calls genlmsg_unicast().
>> But if netlink sock is mmped then netlink-send queues netlink
>> allocated skb->head (linear data of skb) and ignore skb frags.
>>
>> Currently this is not problem with OVS vswitchd since it does not use
>> netlink MMAP sockets. But if vswitchd stats using MMAP-netlink socket,
>> it can break it.
>
>
> The secret is out ;-)
>
> I was very surprised too when I noticed that it worked. It's not just
> OVS, it's nfqueue as well. The reason is that an netlink mmaped skb is
> setup with a giant tailroom in netlink_ring_setup_skb():
>
>         skb->end        = skb->tail + size;
>
For OVS use-case, the size is linear part of skb. so I think for
mmap-netlink socket it will fail.

> and skb_zerocopy() will consume whatever tailroom is available first:
>
>         /* dont bother with small payloads */
>         if (len <= skb_tailroom(to)) {
>                 skb_copy_bits(from, 0, skb_put(to, len), len);
>                 return;
>         }
>
> I was planning to fix this while adding GSO support to the upcall as
> that is the moment when this bug would really surface.

ok.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
       [not found]     ` <5319F272.1070101-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-03-07 17:28       ` Pravin Shelar
       [not found]         ` <CALnjE+q=fejHPsjVj9+jnypJCUHfCDFc553U48WR48Tjcf3FZw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Pravin Shelar @ 2014-03-07 17:28 UTC (permalink / raw)
  To: Thomas Graf
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org, netdev, LKML,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b

On Fri, Mar 7, 2014 at 8:23 AM, Thomas Graf <tgraf-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 02/28/2014 08:16 PM, Zoltan Kiss wrote:
>>
>> The kernel datapath now switched to zerocopy Netlink messages, but that
>> also
>> means that the pages on frags array are sent straight to userspace. If
>> those
>> pages came outside the kernel, we have to swap them out with local copies.
>>
>> Signed-off-by: Zoltan Kiss <zoltan.kiss-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
>> ---
>>   net/openvswitch/datapath.c |    6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
>> index 36f8872..ffb563c 100644
>> --- a/net/openvswitch/datapath.c
>> +++ b/net/openvswitch/datapath.c
>> @@ -464,6 +464,12 @@ static int queue_userspace_packet(struct datapath
>> *dp, struct sk_buff *skb,
>>         }
>>         nla->nla_len = nla_attr_size(skb->len);
>>
>> +       if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) {
>> +               err = -ENOMEM;
>> +               skb_tx_error(skb);
>> +               goto out;
>> +       }
>> +
>>         skb_zerocopy(user_skb, skb, skb->len, hlen);
>>
>>         /* Pad OVS_PACKET_ATTR_PACKET if linear copy was performed */
>
>
> How about we move the call to skb_orphan_frags() into skb_zerocopy()
> itself and call it before we actually reference the frags?
>

Problem is mapping SKBTX_DEV_ZEROCOPY pages to userspace. skb_zerocopy
is not doing that.

Unless I missing something, Current netlink code can not handle
skb-frags with zero copy. So we have to copy skb anyways and no need
to orphan-frags here.
If you are planning on handling skb-frags without copying then
skb_orphan_frags should be done in netlink.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
       [not found]       ` <5319BBAE.7030109-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
@ 2014-03-07 17:38         ` Pravin Shelar
  0 siblings, 0 replies; 15+ messages in thread
From: Pravin Shelar @ 2014-03-07 17:38 UTC (permalink / raw)
  To: Zoltan Kiss
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	kvm-u79uwXL29TY76Z2rM5mHXA, netdev, LKML,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b

On Fri, Mar 7, 2014 at 4:29 AM, Zoltan Kiss <zoltan.kiss-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org> wrote:
> On 07/03/14 04:46, Pravin Shelar wrote:
>>
>> On Thu, Mar 6, 2014 at 9:09 AM, Zoltan Kiss <zoltan.kiss-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
>> wrote:
>>>
>>> Do you have any feedback on this? I'm also adding KVM list as they might
>>> be
>>> interested in this.
>>>
>>> Zoli
>>>
>>>
>>> On 28/02/14 19:16, Zoltan Kiss wrote:
>>>>
>>>>
>>>> The kernel datapath now switched to zerocopy Netlink messages, but that
>>>> also
>>>> means that the pages on frags array are sent straight to userspace. If
>>>> those
>>>> pages came outside the kernel, we have to swap them out with local
>>>> copies.
>>>>
>>>> Signed-off-by: Zoltan Kiss <zoltan.kiss-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
>>
>>
>> I do not think this is required, netlink zero copy only maps
>> pre-allocated buffers to user-space.
>
> How do you mean "pre-allocated"? By who?
>
by netlink, ref. netlink_alloc_skb().

> As far as I've seen the skb in this function came straight from the device
> (vif in our case), and skb_zerocopy just copy the frags to user_skb, which
> is sent to the userspace. Those frags contain pages from guest, and it's a
> bad idea to pass them to userspace: e.g if userspace dies in the meantime,
> what happens with them? Also, in Xen's case they are actually not mapped to
> userspace, so accessing them can lead to garbage.
>
AFAIK Netlink does not handle skb-frags for zero-copy case.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
       [not found]         ` <CALnjE+q=fejHPsjVj9+jnypJCUHfCDFc553U48WR48Tjcf3FZw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-03-07 17:59           ` Thomas Graf
       [not found]             ` <531A0911.4040304-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Graf @ 2014-03-07 17:59 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org, netdev, LKML,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b

On 03/07/2014 06:28 PM, Pravin Shelar wrote:
> Problem is mapping SKBTX_DEV_ZEROCOPY pages to userspace. skb_zerocopy
> is not doing that.
>
> Unless I missing something, Current netlink code can not handle
> skb-frags with zero copy. So we have to copy skb anyways and no need
> to orphan-frags here.
> If you are planning on handling skb-frags without copying then
> skb_orphan_frags should be done in netlink.

If you look at the second part of skb_zerocopy() this is exactly what
it is doing unless the target skb has sufficient linear space
preallocated. At least unless mmap is enabled in which case we would
have to copy again until we have implemented a way to pass page refs
via the nl ring buffer.

So I think Zoltan is correct in orphaning frags that come from f.e.
a tun device via zerocopy_sg_from_iovec().

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
       [not found]             ` <CALnjE+oDM=ga_C6T_-9i2UNwv=K4g-+y-LJA04nh+=WmoeuNXw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-03-07 18:05               ` Thomas Graf
       [not found]                 ` <531A0A5B.2000104-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Graf @ 2014-03-07 18:05 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	kvm-u79uwXL29TY76Z2rM5mHXA, netdev, LKML,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b

On 03/07/2014 06:19 PM, Pravin Shelar wrote:
> On Fri, Mar 7, 2014 at 7:58 AM, Thomas Graf <tgraf-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On 03/07/2014 05:46 AM, Pravin Shelar wrote:
>>>
>>> But I found bug in datapath user-space queue code. I am not sure how
>>> this can work with skb fragments and MMAP-netlink socket.
>>> Here is what happens, OVS allocates netlink skb and adds fragments to
>>> skb using skb_zero_copy(), then calls genlmsg_unicast().
>>> But if netlink sock is mmped then netlink-send queues netlink
>>> allocated skb->head (linear data of skb) and ignore skb frags.
>>>
>>> Currently this is not problem with OVS vswitchd since it does not use
>>> netlink MMAP sockets. But if vswitchd stats using MMAP-netlink socket,
>>> it can break it.
>>
>>
>> The secret is out ;-)
>>
>> I was very surprised too when I noticed that it worked. It's not just
>> OVS, it's nfqueue as well. The reason is that an netlink mmaped skb is
>> setup with a giant tailroom in netlink_ring_setup_skb():
>>
>>          skb->end        = skb->tail + size;
>>
> For OVS use-case, the size is linear part of skb. so I think for
> mmap-netlink socket it will fail.

Could you rephrase? I'm not sure I understand correctly.

The tailroom size equals to the configured frame payload size of
the ring buffer. So as long as the frame size chosen is large
enough to hold whatever pieces comes out of skb_gso_segment() we are
fine. That said, I agree that we should fix this properly before we
enable mmap on the OVS user space side.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
       [not found]             ` <531A0911.4040304-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-03-07 18:41               ` Pravin Shelar
  2014-03-11 19:41               ` Zoltan Kiss
  1 sibling, 0 replies; 15+ messages in thread
From: Pravin Shelar @ 2014-03-07 18:41 UTC (permalink / raw)
  To: Thomas Graf
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org, netdev, LKML,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b

On Fri, Mar 7, 2014 at 9:59 AM, Thomas Graf <tgraf-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 03/07/2014 06:28 PM, Pravin Shelar wrote:
>>
>> Problem is mapping SKBTX_DEV_ZEROCOPY pages to userspace. skb_zerocopy
>> is not doing that.
>>
>> Unless I missing something, Current netlink code can not handle
>> skb-frags with zero copy. So we have to copy skb anyways and no need
>> to orphan-frags here.
>> If you are planning on handling skb-frags without copying then
>> skb_orphan_frags should be done in netlink.
>
>
> If you look at the second part of skb_zerocopy() this is exactly what
> it is doing unless the target skb has sufficient linear space
> preallocated. At least unless mmap is enabled in which case we would
> have to copy again until we have implemented a way to pass page refs
> via the nl ring buffer.
>

What is problem with passing page ref with get_page() to new skb as
long as it is accessed kernel?

I think we should set SKBTX_DEV_ZEROCOPY flag for new skb in
skb_zerocopy() to track such pages but no need call
skb_orphan_frags().

> So I think Zoltan is correct in orphaning frags that come from f.e.
> a tun device via zerocopy_sg_from_iovec().

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
       [not found]                 ` <531A0A5B.2000104-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-03-07 18:43                   ` Pravin Shelar
  0 siblings, 0 replies; 15+ messages in thread
From: Pravin Shelar @ 2014-03-07 18:43 UTC (permalink / raw)
  To: Thomas Graf
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	kvm-u79uwXL29TY76Z2rM5mHXA, netdev, LKML,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b

On Fri, Mar 7, 2014 at 10:05 AM, Thomas Graf <tgraf-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 03/07/2014 06:19 PM, Pravin Shelar wrote:
>>
>> On Fri, Mar 7, 2014 at 7:58 AM, Thomas Graf <tgraf-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>>
>>> On 03/07/2014 05:46 AM, Pravin Shelar wrote:
>>>>
>>>>
>>>> But I found bug in datapath user-space queue code. I am not sure how
>>>> this can work with skb fragments and MMAP-netlink socket.
>>>> Here is what happens, OVS allocates netlink skb and adds fragments to
>>>> skb using skb_zero_copy(), then calls genlmsg_unicast().
>>>> But if netlink sock is mmped then netlink-send queues netlink
>>>> allocated skb->head (linear data of skb) and ignore skb frags.
>>>>
>>>> Currently this is not problem with OVS vswitchd since it does not use
>>>> netlink MMAP sockets. But if vswitchd stats using MMAP-netlink socket,
>>>> it can break it.
>>>
>>>
>>>
>>> The secret is out ;-)
>>>
>>> I was very surprised too when I noticed that it worked. It's not just
>>> OVS, it's nfqueue as well. The reason is that an netlink mmaped skb is
>>> setup with a giant tailroom in netlink_ring_setup_skb():
>>>
>>>          skb->end        = skb->tail + size;
>>>
>> For OVS use-case, the size is linear part of skb. so I think for
>> mmap-netlink socket it will fail.
>
>
> Could you rephrase? I'm not sure I understand correctly.
>
I meant OVS (queue_userspace_packet) passes skb_zerocopy_headlen() to
genlmsg_new_unicast().

> The tailroom size equals to the configured frame payload size of
> the ring buffer. So as long as the frame size chosen is large
> enough to hold whatever pieces comes out of skb_gso_segment() we are
> fine. That said, I agree that we should fix this properly before we
> enable mmap on the OVS user space side.

I see. So this work assuming that userspace sets
nl_mmap_req.nm_frame_size large enough.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
       [not found]             ` <531A0911.4040304-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-03-07 18:41               ` Pravin Shelar
@ 2014-03-11 19:41               ` Zoltan Kiss
       [not found]                 ` <531F66D0.1050000-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 15+ messages in thread
From: Zoltan Kiss @ 2014-03-11 19:41 UTC (permalink / raw)
  To: Thomas Graf, Pravin Shelar
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b, LKML, netdev

On 07/03/14 17:59, Thomas Graf wrote:
> On 03/07/2014 06:28 PM, Pravin Shelar wrote:
>> Problem is mapping SKBTX_DEV_ZEROCOPY pages to userspace. skb_zerocopy
>> is not doing that.
>>
>> Unless I missing something, Current netlink code can not handle
>> skb-frags with zero copy. So we have to copy skb anyways and no need
>> to orphan-frags here.
>> If you are planning on handling skb-frags without copying then
>> skb_orphan_frags should be done in netlink.
>
> If you look at the second part of skb_zerocopy() this is exactly what
> it is doing unless the target skb has sufficient linear space
> preallocated. At least unless mmap is enabled in which case we would
> have to copy again until we have implemented a way to pass page refs
> via the nl ring buffer.
>
> So I think Zoltan is correct in orphaning frags that come from f.e.
> a tun device via zerocopy_sg_from_iovec().

Now as I'm checking how Netlink works, I might be wrong at some parts :) 
skb_zerocopy correctly add the frags to the user_skb we are sending 
upwards, however when the userspace receive it in netlink_recvmsg(), it 
gets copied to the supplied buffer anyway. Is that correct? In which 
case we don't need to worry that userspace will sit on that page 
indefinitely. However we have to worry about userspace not calling recv 
on that Netlink socket, so in the end we still need skb_orphan_frags, 
just for a different reason :)
We can put skb_orphan_frags into skb_zerocopy, skb_clone also do that.

However with Netlink mmapped IO, we should take a different approach, 
and instead of calling skb_orphan_frags we should make sure user_skb can 
hold any skb we get from the kernel, and copy the frags there. Even if 
we would be able to pass page refs to userspace through the ring buffer 
(AFAIK currently we can't), it would be fragile to just pass kernel 
pages directly to userspace, even if they came without the 
SKBTX_DEV_ZEROCOPY flag. And I think it would be quite rare that we need 
that copy anyway, because the flow setup usually happens with small 
packets without frags.
If we choose the above approach with Netlink mmap, we don't need 
skb_orphan_frags, in fact

Regards,

Zoli

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
       [not found]                 ` <531F66D0.1050000-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
@ 2014-03-14 22:26                   ` Zoltan Kiss
  0 siblings, 0 replies; 15+ messages in thread
From: Zoltan Kiss @ 2014-03-14 22:26 UTC (permalink / raw)
  To: Thomas Graf, Pravin Shelar
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b, LKML, netdev

On 11/03/14 19:41, Zoltan Kiss wrote:
> On 07/03/14 17:59, Thomas Graf wrote:
>> On 03/07/2014 06:28 PM, Pravin Shelar wrote:
>>> Problem is mapping SKBTX_DEV_ZEROCOPY pages to userspace. skb_zerocopy
>>> is not doing that.
>>>
>>> Unless I missing something, Current netlink code can not handle
>>> skb-frags with zero copy. So we have to copy skb anyways and no need
>>> to orphan-frags here.
>>> If you are planning on handling skb-frags without copying then
>>> skb_orphan_frags should be done in netlink.
>>
>> If you look at the second part of skb_zerocopy() this is exactly what
>> it is doing unless the target skb has sufficient linear space
>> preallocated. At least unless mmap is enabled in which case we would
>> have to copy again until we have implemented a way to pass page refs
>> via the nl ring buffer.
>>
>> So I think Zoltan is correct in orphaning frags that come from f.e.
>> a tun device via zerocopy_sg_from_iovec().
>
> Now as I'm checking how Netlink works, I might be wrong at some parts :)
> skb_zerocopy correctly add the frags to the user_skb we are sending
> upwards, however when the userspace receive it in netlink_recvmsg(), it
> gets copied to the supplied buffer anyway. Is that correct? In which
> case we don't need to worry that userspace will sit on that page
> indefinitely. However we have to worry about userspace not calling recv
> on that Netlink socket, so in the end we still need skb_orphan_frags,
> just for a different reason :)
> We can put skb_orphan_frags into skb_zerocopy, skb_clone also do that.
>
> However with Netlink mmapped IO, we should take a different approach,
> and instead of calling skb_orphan_frags we should make sure user_skb can
> hold any skb we get from the kernel, and copy the frags there. Even if
> we would be able to pass page refs to userspace through the ring buffer
> (AFAIK currently we can't), it would be fragile to just pass kernel
> pages directly to userspace, even if they came without the
> SKBTX_DEV_ZEROCOPY flag. And I think it would be quite rare that we need
> that copy anyway, because the flow setup usually happens with small
> packets without frags.
> If we choose the above approach with Netlink mmap, we don't need
> skb_orphan_frags, in fact

I spent some time to think about this mmaped scenaria, and discussed it 
with others: the conclusion is that it shouldn't be a big problem to 
expose local kernel pages through the frags array as I thought before. 
So OVS can get along with passing refs to those pages in the shared 
ring. However skb_orphan_frags would be still necessary in skb_zerocopy, 
for the same reason as now.
Should I post a new patch which does calls orphan_frags in zerocopy? Or 
do you have any other opinion?

Zoli

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-03-14 22:26 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-28 19:16 [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall Zoltan Kiss
2014-03-06 17:09 ` Zoltan Kiss
2014-03-07  4:46   ` Pravin Shelar
2014-03-07 12:29     ` Zoltan Kiss
     [not found]       ` <5319BBAE.7030109-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
2014-03-07 17:38         ` Pravin Shelar
     [not found]     ` <CALnjE+rWc=n_F+1jSLQtPrgKSvvxONEkkYxWEHon2_KVNG9z3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-03-07 15:58       ` Thomas Graf
     [not found]         ` <5319EC8E.2010606-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-03-07 17:19           ` Pravin Shelar
     [not found]             ` <CALnjE+oDM=ga_C6T_-9i2UNwv=K4g-+y-LJA04nh+=WmoeuNXw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-03-07 18:05               ` Thomas Graf
     [not found]                 ` <531A0A5B.2000104-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-03-07 18:43                   ` Pravin Shelar
     [not found] ` <1393615016-9187-1-git-send-email-zoltan.kiss-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
2014-03-07 16:23   ` Thomas Graf
     [not found]     ` <5319F272.1070101-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-03-07 17:28       ` Pravin Shelar
     [not found]         ` <CALnjE+q=fejHPsjVj9+jnypJCUHfCDFc553U48WR48Tjcf3FZw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-03-07 17:59           ` Thomas Graf
     [not found]             ` <531A0911.4040304-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-03-07 18:41               ` Pravin Shelar
2014-03-11 19:41               ` Zoltan Kiss
     [not found]                 ` <531F66D0.1050000-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
2014-03-14 22:26                   ` Zoltan Kiss

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).