From: Florian Westphal <fw@strlen.de>
To: David Miller <davem@davemloft.net>
Cc: fw@strlen.de, netdev@vger.kernel.org, hannes@stressinduktion.org,
edumazet@google.com, herbert@gondor.apana.org.au
Subject: Re: [PATCH -next] net: preserve geometry of fragment sizes when forwarding
Date: Tue, 19 May 2015 14:34:15 +0200 [thread overview]
Message-ID: <20150519123415.GC5063@breakpoint.cc> (raw)
In-Reply-To: <20150518.195105.2072041210798369557.davem@davemloft.net>
David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Mon, 18 May 2015 23:33:29 +0200
>
> > Thanks for spending time on this.
>
> Ok, I've heard what you have to say.
>
> Of the fixes you've proposed already, I prefer the device MTU one
> because it doesn't penalize the ip_fragment.c optimizations just
> because a netfilter module is loaded.
True, OTOH it only allows removing the br_nf hacks, it doesn't fix
a few corner cases (see below).
> Which of your already proposed patches do you prefer, and why?
The IPCB one, main change from that set:
http://patchwork.ozlabs.org/patch/460252/
> I'm happy to apply one or the other as an interim step to make
> forward progress. However, I really want to seriously consider
> our long term handling of fragments in netfilter meanwhile.
Here is a summary of the two.
Both allow removal of bridge netfilter mtu kludge that we have in
ip_fragment, but they have different pros and cons.
A. Add mtu argument to ip_fragment.
pro: actual change is small (most is refactoring)
pro: obvious code flow for bridge netfilter, we can
pass device_mtu - reserved_mtu (encap overhead for ppp, vlan, etc)
as 'new' device mtu, effectively moving this hack from ip_fragment to
br_netfilter.
cons: doesn't resolve hypothetical(?) router edge cases:
#1. if we'd receive e.g. fragment size 1000 and fragment size
200, where both fragments have DF bit set, we will currently
send a single 1200 byte DF packet, if outdevice has mtu >= 1200.
This is because ip_finish_output() only fragments if skb->len > mtu.
#2 iff we do refragment (e.g. out mtu is 1000), we re-create fragments
of size 1000 and 200, but DF bit gets lost in the process.
Neither seems to be a big problem, #2 doesn't break end-to-end connectivity,
and #1 doesn't seem to happen in practice; else we should have seen
bug reports by now. However, I do feel uneasy about it and think we should
fix it.
last A series diffstat:
include/linux/netfilter_bridge.h | 7 -------
include/net/ip.h | 2 +-
net/bridge/br_netfilter.c | 34 +++++++++++++++++++++++++++++++---
net/ipv4/ip_output.c | 29 ++++++++++++++++++-----------
4 files changed, 50 insertions(+), 22 deletions(-)
B. store largest fragment size in IPCB and use that.
pro: we already do this to reject refragmented skbs that had DF set if
the largest original fragment is larger than the outdevice mtu, so this
change is not as big as one might think.
The main aspects of this change is that ip_fragment will do
mtu = IPCB(skb)->largest_frag_size;
if (!mtu)
mtu = ip_skb_dst_mtu(skb);
instead of just calling ip_skb_dst_mtu(), and that we will call
ip_fragment on all skbs that have new IPSKB_FRAG_PMTU flag set, i.e.
if ((IPCB(skb)->flags & IPSKB_FRAG_PMTU) || skb->len > mtu)
ip_fragment()
This will prevent us from sending packets that are larger than
the biggest original fragment we originally got.
bridge netfilter can make use of this; since we store largest
size seen we can just remove the mtu encap overhead calculations.
For ipv6, bridge netfilter will have to be fixed
to preserve IP6CB largest size, just like we do for IPCB.
This will be a bridge netfilter specific change only.
last B series had this diffstat:
include/net/inet_frag.h | 2 +-
include/net/ip.h | 1 +
net/ipv4/ip_forward.c | 18 +++++++++++-------
net/ipv4/ip_fragment.c | 31 ++++++++++++++++++++++++++-----
net/ipv4/ip_output.c | 37 ++++++++++++++++++++++++++++---------
net/ipv6/ip6_output.c | 29 ++++++++++++++++++-----------
6 files changed, 85 insertions(+), 33 deletions(-)
My suggestion would be to go with #B. Its a more intrusive change compared to A,
but it allows to get rid of br netfilter hack while also resolving silly and/or
rare corner cases, such as e.g. reassembling two ipv6 fragments and then failing
to undo it on forward.
Neither approach prevents us from making skb_has_frag_list() fast-path
in ip_fragment work again at a later point.
I can rebase and resubmit both proposals if you prefer so you can review
them in more detail.
Thanks.
next prev parent reply other threads:[~2015-05-19 12:34 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-05-07 21:04 [PATCH -next] net: preserve geometry of fragment sizes when forwarding Florian Westphal
2015-05-18 19:39 ` David Miller
2015-05-18 20:06 ` Florian Westphal
2015-05-18 20:28 ` David Miller
2015-05-18 20:40 ` Florian Westphal
2015-05-18 20:55 ` David Miller
2015-05-18 21:33 ` Florian Westphal
2015-05-18 22:50 ` Herbert Xu
2015-05-18 23:02 ` Florian Westphal
2015-05-18 23:20 ` Herbert Xu
2015-05-18 23:51 ` David Miller
2015-05-19 12:34 ` Florian Westphal [this message]
2015-05-19 19:34 ` Jay Vosburgh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150519123415.GC5063@breakpoint.cc \
--to=fw@strlen.de \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=hannes@stressinduktion.org \
--cc=herbert@gondor.apana.org.au \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).