* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
[not found] <1393615016-9187-1-git-send-email-zoltan.kiss@citrix.com>
@ 2014-03-06 17:09 ` Zoltan Kiss
2014-03-07 4:46 ` Pravin Shelar
0 siblings, 1 reply; 8+ messages in thread
From: Zoltan Kiss @ 2014-03-06 17:09 UTC (permalink / raw)
To: Jesse Gross, pshelar, tgraf, dev; +Cc: xen-devel, netdev, linux-kernel, kvm
Do you have any feedback on this? I'm also adding KVM list as they might
be interested in this.
Zoli
On 28/02/14 19:16, Zoltan Kiss wrote:
> The kernel datapath now switched to zerocopy Netlink messages, but that also
> means that the pages on frags array are sent straight to userspace. If those
> pages came outside the kernel, we have to swap them out with local copies.
>
> Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
> ---
> net/openvswitch/datapath.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 36f8872..ffb563c 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -464,6 +464,12 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
> }
> nla->nla_len = nla_attr_size(skb->len);
>
> + if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) {
> + err = -ENOMEM;
> + skb_tx_error(skb);
> + goto out;
> + }
> +
> skb_zerocopy(user_skb, skb, skb->len, hlen);
>
> /* Pad OVS_PACKET_ATTR_PACKET if linear copy was performed */
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
2014-03-06 17:09 ` [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall Zoltan Kiss
@ 2014-03-07 4:46 ` Pravin Shelar
2014-03-07 12:29 ` Zoltan Kiss
[not found] ` <CALnjE+rWc=n_F+1jSLQtPrgKSvvxONEkkYxWEHon2_KVNG9z3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 2 replies; 8+ messages in thread
From: Pravin Shelar @ 2014-03-07 4:46 UTC (permalink / raw)
To: Zoltan Kiss
Cc: Jesse Gross, Thomas Graf, dev@openvswitch.org, xen-devel, netdev,
LKML, kvm
On Thu, Mar 6, 2014 at 9:09 AM, Zoltan Kiss <zoltan.kiss@citrix.com> wrote:
> Do you have any feedback on this? I'm also adding KVM list as they might be
> interested in this.
>
> Zoli
>
>
> On 28/02/14 19:16, Zoltan Kiss wrote:
>>
>> The kernel datapath now switched to zerocopy Netlink messages, but that
>> also
>> means that the pages on frags array are sent straight to userspace. If
>> those
>> pages came outside the kernel, we have to swap them out with local copies.
>>
>> Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
I do not think this is required, netlink zero copy only maps
pre-allocated buffers to user-space.
But I found bug in datapath user-space queue code. I am not sure how
this can work with skb fragments and MMAP-netlink socket.
Here is what happens, OVS allocates netlink skb and adds fragments to
skb using skb_zero_copy(), then calls genlmsg_unicast().
But if netlink sock is mmped then netlink-send queues netlink
allocated skb->head (linear data of skb) and ignore skb frags.
Currently this is not problem with OVS vswitchd since it does not use
netlink MMAP sockets. But if vswitchd stats using MMAP-netlink socket,
it can break it.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
2014-03-07 4:46 ` Pravin Shelar
@ 2014-03-07 12:29 ` Zoltan Kiss
[not found] ` <5319BBAE.7030109-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
[not found] ` <CALnjE+rWc=n_F+1jSLQtPrgKSvvxONEkkYxWEHon2_KVNG9z3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
1 sibling, 1 reply; 8+ messages in thread
From: Zoltan Kiss @ 2014-03-07 12:29 UTC (permalink / raw)
To: Pravin Shelar
Cc: Jesse Gross, Thomas Graf, dev@openvswitch.org, xen-devel, netdev,
LKML, kvm
On 07/03/14 04:46, Pravin Shelar wrote:
> On Thu, Mar 6, 2014 at 9:09 AM, Zoltan Kiss <zoltan.kiss@citrix.com> wrote:
>> Do you have any feedback on this? I'm also adding KVM list as they might be
>> interested in this.
>>
>> Zoli
>>
>>
>> On 28/02/14 19:16, Zoltan Kiss wrote:
>>>
>>> The kernel datapath now switched to zerocopy Netlink messages, but that
>>> also
>>> means that the pages on frags array are sent straight to userspace. If
>>> those
>>> pages came outside the kernel, we have to swap them out with local copies.
>>>
>>> Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
>
> I do not think this is required, netlink zero copy only maps
> pre-allocated buffers to user-space.
How do you mean "pre-allocated"? By who?
As far as I've seen the skb in this function came straight from the
device (vif in our case), and skb_zerocopy just copy the frags to
user_skb, which is sent to the userspace. Those frags contain pages from
guest, and it's a bad idea to pass them to userspace: e.g if userspace
dies in the meantime, what happens with them? Also, in Xen's case they
are actually not mapped to userspace, so accessing them can lead to garbage.
Zoli
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
[not found] ` <CALnjE+rWc=n_F+1jSLQtPrgKSvvxONEkkYxWEHon2_KVNG9z3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-03-07 15:58 ` Thomas Graf
[not found] ` <5319EC8E.2010606-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 8+ messages in thread
From: Thomas Graf @ 2014-03-07 15:58 UTC (permalink / raw)
To: Pravin Shelar, Zoltan Kiss
Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
kvm-u79uwXL29TY76Z2rM5mHXA, netdev, LKML,
xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b
On 03/07/2014 05:46 AM, Pravin Shelar wrote:
> But I found bug in datapath user-space queue code. I am not sure how
> this can work with skb fragments and MMAP-netlink socket.
> Here is what happens, OVS allocates netlink skb and adds fragments to
> skb using skb_zero_copy(), then calls genlmsg_unicast().
> But if netlink sock is mmped then netlink-send queues netlink
> allocated skb->head (linear data of skb) and ignore skb frags.
>
> Currently this is not problem with OVS vswitchd since it does not use
> netlink MMAP sockets. But if vswitchd stats using MMAP-netlink socket,
> it can break it.
The secret is out ;-)
I was very surprised too when I noticed that it worked. It's not just
OVS, it's nfqueue as well. The reason is that an netlink mmaped skb is
setup with a giant tailroom in netlink_ring_setup_skb():
skb->end = skb->tail + size;
and skb_zerocopy() will consume whatever tailroom is available first:
/* dont bother with small payloads */
if (len <= skb_tailroom(to)) {
skb_copy_bits(from, 0, skb_put(to, len), len);
return;
}
I was planning to fix this while adding GSO support to the upcall as
that is the moment when this bug would really surface.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
[not found] ` <5319EC8E.2010606-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-03-07 17:19 ` Pravin Shelar
[not found] ` <CALnjE+oDM=ga_C6T_-9i2UNwv=K4g-+y-LJA04nh+=WmoeuNXw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 8+ messages in thread
From: Pravin Shelar @ 2014-03-07 17:19 UTC (permalink / raw)
To: Thomas Graf
Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
kvm-u79uwXL29TY76Z2rM5mHXA, netdev, LKML,
xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b
On Fri, Mar 7, 2014 at 7:58 AM, Thomas Graf <tgraf-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 03/07/2014 05:46 AM, Pravin Shelar wrote:
>>
>> But I found bug in datapath user-space queue code. I am not sure how
>> this can work with skb fragments and MMAP-netlink socket.
>> Here is what happens, OVS allocates netlink skb and adds fragments to
>> skb using skb_zero_copy(), then calls genlmsg_unicast().
>> But if netlink sock is mmped then netlink-send queues netlink
>> allocated skb->head (linear data of skb) and ignore skb frags.
>>
>> Currently this is not problem with OVS vswitchd since it does not use
>> netlink MMAP sockets. But if vswitchd stats using MMAP-netlink socket,
>> it can break it.
>
>
> The secret is out ;-)
>
> I was very surprised too when I noticed that it worked. It's not just
> OVS, it's nfqueue as well. The reason is that an netlink mmaped skb is
> setup with a giant tailroom in netlink_ring_setup_skb():
>
> skb->end = skb->tail + size;
>
For OVS use-case, the size is linear part of skb. so I think for
mmap-netlink socket it will fail.
> and skb_zerocopy() will consume whatever tailroom is available first:
>
> /* dont bother with small payloads */
> if (len <= skb_tailroom(to)) {
> skb_copy_bits(from, 0, skb_put(to, len), len);
> return;
> }
>
> I was planning to fix this while adding GSO support to the upcall as
> that is the moment when this bug would really surface.
ok.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
[not found] ` <5319BBAE.7030109-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
@ 2014-03-07 17:38 ` Pravin Shelar
0 siblings, 0 replies; 8+ messages in thread
From: Pravin Shelar @ 2014-03-07 17:38 UTC (permalink / raw)
To: Zoltan Kiss
Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
kvm-u79uwXL29TY76Z2rM5mHXA, netdev, LKML,
xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b
On Fri, Mar 7, 2014 at 4:29 AM, Zoltan Kiss <zoltan.kiss-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org> wrote:
> On 07/03/14 04:46, Pravin Shelar wrote:
>>
>> On Thu, Mar 6, 2014 at 9:09 AM, Zoltan Kiss <zoltan.kiss-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
>> wrote:
>>>
>>> Do you have any feedback on this? I'm also adding KVM list as they might
>>> be
>>> interested in this.
>>>
>>> Zoli
>>>
>>>
>>> On 28/02/14 19:16, Zoltan Kiss wrote:
>>>>
>>>>
>>>> The kernel datapath now switched to zerocopy Netlink messages, but that
>>>> also
>>>> means that the pages on frags array are sent straight to userspace. If
>>>> those
>>>> pages came outside the kernel, we have to swap them out with local
>>>> copies.
>>>>
>>>> Signed-off-by: Zoltan Kiss <zoltan.kiss-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
>>
>>
>> I do not think this is required, netlink zero copy only maps
>> pre-allocated buffers to user-space.
>
> How do you mean "pre-allocated"? By who?
>
by netlink, ref. netlink_alloc_skb().
> As far as I've seen the skb in this function came straight from the device
> (vif in our case), and skb_zerocopy just copy the frags to user_skb, which
> is sent to the userspace. Those frags contain pages from guest, and it's a
> bad idea to pass them to userspace: e.g if userspace dies in the meantime,
> what happens with them? Also, in Xen's case they are actually not mapped to
> userspace, so accessing them can lead to garbage.
>
AFAIK Netlink does not handle skb-frags for zero-copy case.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
[not found] ` <CALnjE+oDM=ga_C6T_-9i2UNwv=K4g-+y-LJA04nh+=WmoeuNXw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-03-07 18:05 ` Thomas Graf
[not found] ` <531A0A5B.2000104-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 8+ messages in thread
From: Thomas Graf @ 2014-03-07 18:05 UTC (permalink / raw)
To: Pravin Shelar
Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
kvm-u79uwXL29TY76Z2rM5mHXA, netdev, LKML,
xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b
On 03/07/2014 06:19 PM, Pravin Shelar wrote:
> On Fri, Mar 7, 2014 at 7:58 AM, Thomas Graf <tgraf-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On 03/07/2014 05:46 AM, Pravin Shelar wrote:
>>>
>>> But I found bug in datapath user-space queue code. I am not sure how
>>> this can work with skb fragments and MMAP-netlink socket.
>>> Here is what happens, OVS allocates netlink skb and adds fragments to
>>> skb using skb_zero_copy(), then calls genlmsg_unicast().
>>> But if netlink sock is mmped then netlink-send queues netlink
>>> allocated skb->head (linear data of skb) and ignore skb frags.
>>>
>>> Currently this is not problem with OVS vswitchd since it does not use
>>> netlink MMAP sockets. But if vswitchd stats using MMAP-netlink socket,
>>> it can break it.
>>
>>
>> The secret is out ;-)
>>
>> I was very surprised too when I noticed that it worked. It's not just
>> OVS, it's nfqueue as well. The reason is that an netlink mmaped skb is
>> setup with a giant tailroom in netlink_ring_setup_skb():
>>
>> skb->end = skb->tail + size;
>>
> For OVS use-case, the size is linear part of skb. so I think for
> mmap-netlink socket it will fail.
Could you rephrase? I'm not sure I understand correctly.
The tailroom size equals to the configured frame payload size of
the ring buffer. So as long as the frame size chosen is large
enough to hold whatever pieces comes out of skb_gso_segment() we are
fine. That said, I agree that we should fix this properly before we
enable mmap on the OVS user space side.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall
[not found] ` <531A0A5B.2000104-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-03-07 18:43 ` Pravin Shelar
0 siblings, 0 replies; 8+ messages in thread
From: Pravin Shelar @ 2014-03-07 18:43 UTC (permalink / raw)
To: Thomas Graf
Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
kvm-u79uwXL29TY76Z2rM5mHXA, netdev, LKML,
xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b
On Fri, Mar 7, 2014 at 10:05 AM, Thomas Graf <tgraf-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 03/07/2014 06:19 PM, Pravin Shelar wrote:
>>
>> On Fri, Mar 7, 2014 at 7:58 AM, Thomas Graf <tgraf-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>>
>>> On 03/07/2014 05:46 AM, Pravin Shelar wrote:
>>>>
>>>>
>>>> But I found bug in datapath user-space queue code. I am not sure how
>>>> this can work with skb fragments and MMAP-netlink socket.
>>>> Here is what happens, OVS allocates netlink skb and adds fragments to
>>>> skb using skb_zero_copy(), then calls genlmsg_unicast().
>>>> But if netlink sock is mmped then netlink-send queues netlink
>>>> allocated skb->head (linear data of skb) and ignore skb frags.
>>>>
>>>> Currently this is not problem with OVS vswitchd since it does not use
>>>> netlink MMAP sockets. But if vswitchd stats using MMAP-netlink socket,
>>>> it can break it.
>>>
>>>
>>>
>>> The secret is out ;-)
>>>
>>> I was very surprised too when I noticed that it worked. It's not just
>>> OVS, it's nfqueue as well. The reason is that an netlink mmaped skb is
>>> setup with a giant tailroom in netlink_ring_setup_skb():
>>>
>>> skb->end = skb->tail + size;
>>>
>> For OVS use-case, the size is linear part of skb. so I think for
>> mmap-netlink socket it will fail.
>
>
> Could you rephrase? I'm not sure I understand correctly.
>
I meant OVS (queue_userspace_packet) passes skb_zerocopy_headlen() to
genlmsg_new_unicast().
> The tailroom size equals to the configured frame payload size of
> the ring buffer. So as long as the frame size chosen is large
> enough to hold whatever pieces comes out of skb_gso_segment() we are
> fine. That said, I agree that we should fix this properly before we
> enable mmap on the OVS user space side.
I see. So this work assuming that userspace sets
nl_mmap_req.nm_frame_size large enough.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2014-03-07 18:43 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1393615016-9187-1-git-send-email-zoltan.kiss@citrix.com>
2014-03-06 17:09 ` [PATCH] openvswitch: Orphan frags before sending to userspace via Netlink to avoid guest stall Zoltan Kiss
2014-03-07 4:46 ` Pravin Shelar
2014-03-07 12:29 ` Zoltan Kiss
[not found] ` <5319BBAE.7030109-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
2014-03-07 17:38 ` Pravin Shelar
[not found] ` <CALnjE+rWc=n_F+1jSLQtPrgKSvvxONEkkYxWEHon2_KVNG9z3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-03-07 15:58 ` Thomas Graf
[not found] ` <5319EC8E.2010606-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-03-07 17:19 ` Pravin Shelar
[not found] ` <CALnjE+oDM=ga_C6T_-9i2UNwv=K4g-+y-LJA04nh+=WmoeuNXw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-03-07 18:05 ` Thomas Graf
[not found] ` <531A0A5B.2000104-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-03-07 18:43 ` Pravin Shelar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox