From: Florian Westphal <fw@strlen.de>
To: Shmulik Ladkani <shmulik.ladkani@ravellosystems.com>
Cc: Florian Westphal <fw@strlen.de>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Hannes Frederic Sowa <hannes@stressinduktion.org>,
shmulik.ladkani@gmail.com, netdev@vger.kernel.org
Subject: Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, do segmentation even for non IPSKB_FORWARDED skbs
Date: Sat, 9 Jul 2016 11:00:20 +0200 [thread overview]
Message-ID: <20160709090020.GB2067@breakpoint.cc> (raw)
In-Reply-To: <20160705170541.3f210675@pixies>
Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote:
[ CC
> On Tue, 5 Jul 2016 15:03:27 +0200, fw@strlen.de wrote:
> > > The expected behavior in such a setup would be segmenting the skb first,
> > > and then fragmenting each segment according to dst mtu, and finally
> > > passing the resulting fragments to ip_finish_output2.
> > >
> > > 'ip_finish_output_gso' already supports this "Slowpath" behavior,
> > > but it is only considered if IPSKB_FORWARDED is set.
> > >
> > > However in the bridged case, IPSKB_FORWARDED is off, and the "Slowpath"
> > > behavior is not considered.
> >
> > I placed this test there under the assumption that L2 bridges have
> > the same MTU on all bridge ports, so we'd only need to consider routing
> > case.
>
> In our setups we have no control of VM mtu (which affects gso_size of
> packets arriving from tap), and no control of vxlan's underlay mtu.
:-(
> > How does work if e.g. 1460-sized udp packet arrives on tap0?
> > Do we fragment (possibly ignoring DF?)
>
> A *non* gso udp packet arriving on tap0 is bridged to vxlan0 (assuming
> vxlan mtu is sufficient), and the original DF of the inner packet is
> preserved.
>
> The skb gets vxlan-udp encapsulated, with outer IP header having DF=0
> (unless tun_flags & TUNNEL_DONT_FRAGMENT), and then, if skb->len > mtu,
> fragmented normally at the ip_finish_output --> ip_fragment code path.
I see.
> So on wire we have 2 frags of the vxlan datagram; they are reassembled
> at recepient ip stack of vxlan termination. Inner packet preserved.
> Not ideal, but works.
>
> The issue is with GSO skbs arriving from tap, which eventually generates
> segments larger then the mtu, which are not transmitted on eth0:
>
> tap0 rx: super packet, gso_size from user's virtio_net_hdr
> ...
> vxlan0 tx: encaps the super packet
> ...
> ip_finish_output
> ip_finish_output_gso
> *NO* skb_gso_validate_mtu() <--- problem here
We don't do this for ipv6 either since we're expected to send PKTOOBIG
error.
> ip_finish_output2: tx the encapsulated super packet on eth0
> ...
> validate_xmit_skb
> netif_needs_gso
> skb_gso_segment: segments inner payload according to
> original gso_size,
> leads to vxlan datagrams larger than mtu
>
> > How does it work for non-ip protocols?
>
> The specific problem is with vxlan (or any other udp based tunnel)
> encapsulated GSOed packets.
If I understand correctly you have vxlan stacked on top of eth0, and tap
and vxlan in a bridge.
... and eth mtu smaller than bridge mtu.
I think that this is "working" by pure accident, and that better fix is
to set mtu values correctly so that when vxlan header is added we don't
exceed what can be handled by the real device (yes, I know you have
no control over this).
I am worried about this patch, skb_gso_validate_mtu is more costly than
the ->flags & FORWARD check; everyone pays this extra cost.
What about setting IPCB FORWARD flag in iptunnel_xmit if
skb->skb_iif != 0... instead?
Yet another possibility would be to reduce gso_size but that violates
gro/gso symmetry...
[ I tried to check rfc but seems rfc7348 simply declares that
endpoints are not allowed to fragment so problem solved :-/ ]
next prev parent reply other threads:[~2016-07-09 9:00 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-07-05 12:35 [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, do segmentation even for non IPSKB_FORWARDED skbs Shmulik Ladkani
2016-07-05 13:03 ` Florian Westphal
2016-07-05 14:05 ` Shmulik Ladkani
2016-07-09 3:12 ` David Miller
2016-07-09 9:06 ` Florian Westphal
2016-07-09 9:00 ` Florian Westphal [this message]
2016-07-09 12:30 ` Shmulik Ladkani
2016-07-09 13:22 ` Florian Westphal
2016-07-10 7:51 ` Shmulik Ladkani
2016-07-11 8:15 ` Florian Westphal
2016-07-11 13:32 ` Hannes Frederic Sowa
2016-07-12 5:56 ` Shmulik Ladkani
2016-07-13 14:00 ` Shmulik Ladkani
2016-07-14 13:12 ` Hannes Frederic Sowa
2016-07-14 14:13 ` Shmulik Ladkani
2016-07-14 23:32 ` Hannes Frederic Sowa
2016-07-10 20:14 ` Shmulik Ladkani
2016-07-11 8:13 ` Florian Westphal
2016-07-09 15:10 ` Hannes Frederic Sowa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160709090020.GB2067@breakpoint.cc \
--to=fw@strlen.de \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=hannes@stressinduktion.org \
--cc=netdev@vger.kernel.org \
--cc=shmulik.ladkani@gmail.com \
--cc=shmulik.ladkani@ravellosystems.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).