From mboxrd@z Thu Jan 1 00:00:00 1970 From: Florian Westphal Subject: Re: [PATCH -next 3/3] ipv4: don't remove df bit when refragmenting Date: Sun, 12 Apr 2015 10:51:22 +0200 Message-ID: <20150412085122.GA14720@breakpoint.cc> References: <1428704189-31247-1-git-send-email-fw@strlen.de> <1428704189-31247-4-git-send-email-fw@strlen.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: netdev@vger.kernel.org, kaber@trash.net, jesse@nicira.com To: Florian Westphal Return-path: Received: from Chamillionaire.breakpoint.cc ([80.244.247.6]:60361 "EHLO Chamillionaire.breakpoint.cc" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750868AbbDLIvZ (ORCPT ); Sun, 12 Apr 2015 04:51:25 -0400 Content-Disposition: inline In-Reply-To: <1428704189-31247-4-git-send-email-fw@strlen.de> Sender: netdev-owner@vger.kernel.org List-ID: Florian Westphal wrote: > We always send fragments without DF bit set. > > Thus, given following setup: > > mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280 > A R1 R2 B > > Where R1 and R2 run linux with netfilter defragmentation/conntrack > enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1 > will respond with icmp too big error if one of these fragments exceeded > 1400 bytes. So far, so good. > > However, the host A will never learn about the lower 1280 link. > The next packet presumably sent by A would use 1400 as the new fragment > size, but R1 will strip DF bit when refragmenting. > > Whats worse: If R1 receives fragment sizes 1200 and 100, it would > forward the reassembled packet without refragmenting, i.e. > R2 would send an icmp error in response to a packet that was never sent, > citing mtu that the original sender never exceeded. > > In order to 'replay' the original fragments to preserve their semantics, > one solution is to > > 1. set DF bit on the new fragments if it was set on original ones. > 2. set the size of the new fragments generated by R1 during > refragmentation to the largest size seen when defragmenting. > > R2 will then notice the problem and will send the expected > 'too big, use 1280' icmp error, and further fragments of this size > would not grow anymore to 1400 link mtu when R1 refragments. > > There is however, one important caveat. We cannot just use existing > IPCB(skb)->frag_max_size as upper boundary for refragmentation. > > We have to consider a case where we receive a large fragment without DF, > followed by a small fragment with DF set. > > In such scenario we must not generate a large spew of small DF-fragments > (else we induce packet/traffic amplification). > > This modifies ip_fragment so that we track largest fragment size seen > both for DF and non-DF packets. > > Then, when we find that we had at least one DF fragment AND the largest > non-DF fragment did not exceed one with DF set, let ip_fragment know that > it should refragment using original frag size and also set DF bit on the > newly created fragments. Seems Jesse would prefer if we'd set max_frag_size unconditionally. The advantage of doing that would be that we can easily get rid of the nf_bridge_mtu_reduction() kludge we have right now since we'd just pick the largest original fragment size. Only caveat is that we'd have to check IPSKB_FRAG_PMTU flag too before deciding to reject if out mtu is too small, which is easy to do. So, before I respin patch #3: What do others think?