[RFC]: ip_conntrack breaks UDP PMTU

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC]: ip_conntrack breaks UDP PMTU
@ 2003-02-14  8:06 Harald Welte
  2003-02-14 13:42 ` Patrick McHardy
  2003-02-15 17:58 ` Thomas Poehnitzsch
  0 siblings, 2 replies; 10+ messages in thread
From: Harald Welte @ 2003-02-14  8:06 UTC (permalink / raw)
  To: Netfilter Development Mailinglist; +Cc: coreteam, kaber

[-- Attachment #1: Type: text/plain, Size: 1944 bytes --]

From https://bugzilla.netfilter.org/cgi-bin/bugzilla/show_bug.cgi?id=48

> ip_conntrack defrags packets at PRE_ROUTING and LOCAL_OUT and
> refragments them at POST_ROUTING without careing about IP_DF. packets
> with IP_DF|IP_MF can be refragmented with a different size, so path
> mtu discovery is broken.  Linux nfs itself sends out packets with
> IP_DF|IP_MF.
>
> ------- Additional Comments From Harald Welte 2003-02-14 09:02 -------
>
> This is a really hard issue. 
>
> The problem is that we _need_ to defragment at NF_IP_PRE_ROUTING in
> order to have the be able to do connection tracking.  So at this point
> we would need to save the sizes of all individual fragments.  This
> would enable us to re-fragment to exactly the same size at
> POST_ROUTING. 
>
> Another obvious approach was to check for IP_DF and see if it is
> bigger than the MTU of the outgoing interface.  The problem is: before
> we do conntrack at NF_IP_PRE_ROUTING we don't know what potential NAT
> bindings apply to this connection/packet - and thus don't know the
> outgoing interface [that's why it's called PRE_ROUTING].
>
> And then, what happens if NAT has to resize (enlarge/shrink) a packet.
> How should we deal with this while re-fragmenting? 
>
> I think this needs some good discussion at netfilter-devel...

So what are we going to do?  Does anybody have an alternative (viable?)
approach?  

And if we go for my first propsal, how/where would we store the
list-of-fragment-sizes?  We certainly don't want it to be dynamically
allocated... but according to RFC791 there kan be 8192 fragments of 8
octets each...

:((

-- 
- Harald Welte <laforge@gnumonks.org>               http://www.gnumonks.org/
============================================================================
"If this were a dictatorship, it'd be a heck of a lot easier, just so long
 as I'm the dictator."  --  George W. Bush Dec 18, 2000

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC]: ip_conntrack breaks UDP PMTU
  2003-02-14  8:06 [RFC]: ip_conntrack breaks UDP PMTU Harald Welte
@ 2003-02-14 13:42 ` Patrick McHardy
  2003-02-14 14:55   ` Harald Welte
  2003-02-15 19:34   ` [netfilter-core] " Jozsef Kadlecsik
  2003-02-15 17:58 ` Thomas Poehnitzsch
  1 sibling, 2 replies; 10+ messages in thread
From: Patrick McHardy @ 2003-02-14 13:42 UTC (permalink / raw)
  To: Harald Welte; +Cc: Netfilter Development Mailinglist, coreteam

Harald Welte wrote:

>From https://bugzilla.netfilter.org/cgi-bin/bugzilla/show_bug.cgi?id=48
>
>  
>
>>ip_conntrack defrags packets at PRE_ROUTING and LOCAL_OUT and
>>refragments them at POST_ROUTING without careing about IP_DF. packets
>>with IP_DF|IP_MF can be refragmented with a different size, so path
>>mtu discovery is broken.  Linux nfs itself sends out packets with
>>IP_DF|IP_MF.
>>
>>------- Additional Comments From Harald Welte 2003-02-14 09:02 -------
>>
>>This is a really hard issue. 
>>
>>The problem is that we _need_ to defragment at NF_IP_PRE_ROUTING in
>>order to have the be able to do connection tracking.  So at this point
>>we would need to save the sizes of all individual fragments.  This
>>would enable us to re-fragment to exactly the same size at
>>POST_ROUTING. 
>>
>>Another obvious approach was to check for IP_DF and see if it is
>>bigger than the MTU of the outgoing interface.  The problem is: before
>>we do conntrack at NF_IP_PRE_ROUTING we don't know what potential NAT
>>bindings apply to this connection/packet - and thus don't know the
>>outgoing interface [that's why it's called PRE_ROUTING].
>>
>>And then, what happens if NAT has to resize (enlarge/shrink) a packet.
>>How should we deal with this while re-fragmenting? 
>>
>>I think this needs some good discussion at netfilter-devel...
>>    
>>
>
>So what are we going to do?  Does anybody have an alternative (viable?)
>approach?  
>
>And if we go for my first propsal, how/where would we store the
>list-of-fragment-sizes?  We certainly don't want it to be dynamically
>allocated... but according to RFC791 there kan be 8192 fragments of 8
>octets each...
>

Usually all fragments except the last one will have equal size, so the 
fragment
sizes can be stored as (size, boundary) tuples. I would suggest making 
the max.
number of different fragment sizes fixed or controllable via sysctl and 
set it to some
low default (like 4). This would reduce the amount of memory per 
reassembled packet
to 4 * (2b + 2b) = 16b.

Bye,
Patrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC]: ip_conntrack breaks UDP PMTU
  2003-02-14 13:42 ` Patrick McHardy
@ 2003-02-14 14:55   ` Harald Welte
  2003-02-15  5:12     ` Patrick McHardy
  2003-02-15 19:34   ` [netfilter-core] " Jozsef Kadlecsik
  1 sibling, 1 reply; 10+ messages in thread
From: Harald Welte @ 2003-02-14 14:55 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Netfilter Development Mailinglist, coreteam

[-- Attachment #1: Type: text/plain, Size: 1131 bytes --]

On Fri, Feb 14, 2003 at 02:42:16PM +0100, Patrick McHardy wrote:

> Usually all fragments except the last one will have equal size, so the
> fragment sizes can be stored as (size, boundary) tuples. I would
> suggest making the max.

yes, usually.  But if we implement this 'fragment-backlog', we should do
it as good as possible.  

I see people already whining about 'firewall can be detected because
overlapping fragments are not present after passing through' or stuff
like this :((

> number of different fragment sizes fixed or controllable via sysctl and 
> set it to some low default (like 4). This would reduce the amount of
> memory per reassembled packet to 4 * (2b + 2b) = 16b.

so what do we do if the number is exceeded?  fallback to current
behaviour?

Thanks for your feedback.

> Bye,
> Patrick

-- 
- Harald Welte <laforge@gnumonks.org>               http://www.gnumonks.org/
============================================================================
"If this were a dictatorship, it'd be a heck of a lot easier, just so long
 as I'm the dictator."  --  George W. Bush Dec 18, 2000

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC]: ip_conntrack breaks UDP PMTU
  2003-02-14 14:55   ` Harald Welte
@ 2003-02-15  5:12     ` Patrick McHardy
  0 siblings, 0 replies; 10+ messages in thread
From: Patrick McHardy @ 2003-02-15  5:12 UTC (permalink / raw)
  To: Harald Welte; +Cc: Netfilter Development Mailinglist, coreteam

Harald Welte wrote:

>On Fri, Feb 14, 2003 at 02:42:16PM +0100, Patrick McHardy wrote:
>
>  
>
>>Usually all fragments except the last one will have equal size, so the
>>fragment sizes can be stored as (size, boundary) tuples. I would
>>suggest making the max.
>>    
>>
>
>yes, usually.  But if we implement this 'fragment-backlog', we should do
>it as good as possible.  
>
>I see people already whining about 'firewall can be detected because
>overlapping fragments are not present after passing through' or stuff
>like this :((
>

This is probably unavoidable as long as we want to use ip_defrag. I think we
really don't want the "perfect solution".

>
>  
>
>>number of different fragment sizes fixed or controllable via sysctl and 
>>set it to some low default (like 4). This would reduce the amount of
>>memory per reassembled packet to 4 * (2b + 2b) = 16b.
>>    
>>
>
>so what do we do if the number is exceeded?  fallback to current
>behaviour?
>

I have to admit, i really don't know,  I would favour dropping such
crap with a higher default of maybe 8-16, although i understand
conntrack should not drop any packets.

I guess some cruel decisions have to be made here, and we haven't even
started to think about mangling nat helpers ..

Bye,
Patrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC]: ip_conntrack breaks UDP PMTU
  2003-02-14  8:06 [RFC]: ip_conntrack breaks UDP PMTU Harald Welte
  2003-02-14 13:42 ` Patrick McHardy
@ 2003-02-15 17:58 ` Thomas Poehnitzsch
  2003-02-15 20:50   ` Patrick McHardy
  2003-02-16 19:54   ` Harald Welte
  1 sibling, 2 replies; 10+ messages in thread
From: Thomas Poehnitzsch @ 2003-02-15 17:58 UTC (permalink / raw)
  To: netfilter developer mailinglist

[-- Attachment #1: Type: text/plain, Size: 4072 bytes --]

Hi,

I hope this discussion is not already over. Sorry, but it took me a
while to understand all the implications and to skip through some RFC's.

On Fri, Feb 14, 2003 at 09:06:12AM +0100, Harald Welte wrote:
> From https://bugzilla.netfilter.org/cgi-bin/bugzilla/show_bug.cgi?id=48

>> ip_conntrack defrags packets at PRE_ROUTING and LOCAL_OUT and
>> refragments them at POST_ROUTING without careing about IP_DF. packets

What has IP_DF (I hope you mean the "Don't Fragment" bit in the
IP-header) to do with _de_-fragmentation? As far as I understood RFC791
("Any internet datagram so marked is not to be internet fragmented under
any circumstances.") this should never happen, and if so another host
down the route has already f***ed up the packet. 
I haven't tested whether iptables really ignores IP_DF in POST_ROUTING,
but if so, this is a serious bug and should be fixed.

>> with IP_DF|IP_MF can be refragmented with a different size, so path
>> mtu discovery is broken.  Linux nfs itself sends out packets with
>> IP_DF|IP_MF.

Could somebody please explain the notation: "IP_DF|IP_MF" to me? Does
this mean at least one of both flags is set? And if so this is against
my understanding of the above mentioned RFC791. If IP_DF is set, the
packet _must not_ be fragmented, so IP_MF can't be set.

Let me go through an example of PMTUD and correct me if I am wrong with
my view of this protocol:

Assume host A wants to send a packet to host B and pmtud is enabled 
(/proc/sys/net/ipv4/ip_no_pmtu_disc = 0) the IP_DF flag will be set in
the packet sent. In case this packet will not pass through the eye of a
needle further down the line, it will be droped and an ICMP Message
(type 3 code 4: fragmentation needed and DF set) will be sent to A.
Host A will then resend smaller packets (again with IP_DF set) until the
packet reaches host B. 
The way I understand it, there won't be any fragmented packet on the
line in this connection, so iptables will not break anything.

If on the other hand the IP_DF bit is not set, any host on the route
from A to B is allowed to refragment the packet to fit the MTU of the
next connection.

Now I can't see any reason why iptables should not be allowed to
reassemble and refragment a packet with IP_DF not set.

>> The problem is that we _need_ to defragment at NF_IP_PRE_ROUTING in
>> order to have the be able to do connection tracking.  So at this point
>> we would need to save the sizes of all individual fragments.  This
>> would enable us to re-fragment to exactly the same size at
>> POST_ROUTING. 

Do we really have to re-fragment to exactly the same size? Wouldn't it
be sufficient to re-fragment to fragments not bigger in size than the
biggest incoming fragment of this connection?

>> And then, what happens if NAT has to resize (enlarge/shrink) a packet.
>> How should we deal with this while re-fragmenting? 

In my opinion we should just refragment it, as any router would do it.

> And if we go for my first propsal, how/where would we store the
> list-of-fragment-sizes?  We certainly don't want it to be dynamically
> allocated... but according to RFC791 there kan be 8192 fragments of 8
> octets each...

I think we have to store fragment sizes of each connection, but storing
the maximum fragment size should be enough. Anyhow, iptables can do lots
and lots of mangling with any packet, so what is it good for to stick
with the original fragments.

As a final comment: What are overlapping fragments good for, except
trying to fool NIDS's? As I could not find any comment on how to deal
with those in the RFC's, we will not break an RFC by "fixing" the
fragments.

Oh allow me another comment: No matter how you are going to fix the
problem, please don't try to fix problems with programs, but rather
stick to the RFC's as close as possible. But well, thats probably what
you have been doing for years now, so forget this comment. ;-)

Ciao!
    Thomas

-- 
"Those puny RFC's are all that separates us from animals."

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [netfilter-core] Re: [RFC]: ip_conntrack breaks UDP PMTU
  2003-02-14 13:42 ` Patrick McHardy
  2003-02-14 14:55   ` Harald Welte
@ 2003-02-15 19:34   ` Jozsef Kadlecsik
  1 sibling, 0 replies; 10+ messages in thread
From: Jozsef Kadlecsik @ 2003-02-15 19:34 UTC (permalink / raw)
  To: netfilter-devel; +Cc: coreteam

> >>ip_conntrack defrags packets at PRE_ROUTING and LOCAL_OUT and
> >>refragments them at POST_ROUTING without careing about IP_DF. packets
> >>with IP_DF|IP_MF can be refragmented with a different size, so path
> >>mtu discovery is broken.  Linux nfs itself sends out packets with
> >>IP_DF|IP_MF.

What about storing the biggest fragment size of a packet at
defragmentation and refragmenting the packet with that size at
POST_ROUTING if MTU is not smaller.

Regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC]: ip_conntrack breaks UDP PMTU
  2003-02-15 17:58 ` Thomas Poehnitzsch
@ 2003-02-15 20:50   ` Patrick McHardy
  2003-02-16 23:55     ` Thomas Poehnitzsch
  2003-02-16 19:54   ` Harald Welte
  1 sibling, 1 reply; 10+ messages in thread
From: Patrick McHardy @ 2003-02-15 20:50 UTC (permalink / raw)
  To: Thomas Poehnitzsch; +Cc: netfilter developer mailinglist

Thomas Poehnitzsch wrote:

>>>ip_conntrack defrags packets at PRE_ROUTING and LOCAL_OUT and
>>>refragments them at POST_ROUTING without careing about IP_DF. packets
>>>      
>>>
>
>What has IP_DF (I hope you mean the "Don't Fragment" bit in the
>IP-header) to do with _de_-fragmentation? As far as I understood RFC791
>
nothing.

>Could somebody please explain the notation: "IP_DF|IP_MF" to me? Does
>this mean at least one of both flags is set? And if so this is against
>my understanding of the above mentioned RFC791. If IP_DF is set, the
>packet _must not_ be fragmented, so IP_MF can't be set.
>
both are set. "|" is logical or. nfs (always?) generates packets bigger 
than mtu
so they are fragmented and have IP_MF set (except last one). If linux 
wants to
know path mtu it sets IP_DF on these, so the fragments may not be _further_
fragmented.

>
>Let me go through an example of PMTUD and correct me if I am wrong with
>my view of this protocol:
>
>Assume host A wants to send a packet to host B and pmtud is enabled 
>(/proc/sys/net/ipv4/ip_no_pmtu_disc = 0) the IP_DF flag will be set in
>the packet sent. In case this packet will not pass through the eye of a
>needle further down the line, it will be droped and an ICMP Message
>(type 3 code 4: fragmentation needed and DF set) will be sent to A.
>Host A will then resend smaller packets (again with IP_DF set) until the
>packet reaches host B. 
>The way I understand it, there won't be any fragmented packet on the
>line in this connection, so iptables will not break anything.
>
>If on the other hand the IP_DF bit is not set, any host on the route
>from A to B is allowed to refragment the packet to fit the MTU of the
>next connection.
>
>Now I can't see any reason why iptables should not be allowed to
>reassemble and refragment a packet with IP_DF not set.
>

see above.

>
>  
>
>>>The problem is that we _need_ to defragment at NF_IP_PRE_ROUTING in
>>>order to have the be able to do connection tracking.  So at this point
>>>we would need to save the sizes of all individual fragments.  This
>>>would enable us to re-fragment to exactly the same size at
>>>POST_ROUTING. 
>>>      
>>>
>
>Do we really have to re-fragment to exactly the same size? Wouldn't it
>be sufficient to re-fragment to fragments not bigger in size than the
>biggest incoming fragment of this connection?
>
>  
>
>>>And then, what happens if NAT has to resize (enlarge/shrink) a packet.
>>>How should we deal with this while re-fragmenting? 
>>>      
>>>
>
>In my opinion we should just refragment it, as any router would do it.
>
that router is broken too. think about a host doing path mtu discovery, 
the packet
doesn't fit the interface mtu but nat shrinks the packet so it does fit 
.. the host gets
a wrong idea of the pmtu. unfortunately i don't know of a way to fix it 
except maybe
to also consider the removed bytes when deciding if a packet needs to be 
fragmented.

>>And if we go for my first propsal, how/where would we store the
>>list-of-fragment-sizes?  We certainly don't want it to be dynamically
>>allocated... but according to RFC791 there kan be 8192 fragments of 8
>>octets each...
>>    
>>
>
>I think we have to store fragment sizes of each connection, but storing
>
even worse we need to store the fragment sizes of each reassembled 
packet. if we consider
the case not all fragments have DF set and we would want to handle nat 
resizing correctly
besides fragment sizes we also need fragment boundaries and fragment 
flags (-> iph->frag_off).

Bye,
Patrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC]: ip_conntrack breaks UDP PMTU
  2003-02-15 17:58 ` Thomas Poehnitzsch
  2003-02-15 20:50   ` Patrick McHardy
@ 2003-02-16 19:54   ` Harald Welte
  1 sibling, 0 replies; 10+ messages in thread
From: Harald Welte @ 2003-02-16 19:54 UTC (permalink / raw)
  To: netfilter developer mailinglist

[-- Attachment #1: Type: text/plain, Size: 3826 bytes --]

On Sat, Feb 15, 2003 at 06:58:41PM +0100, Thomas Poehnitzsch wrote:
> Hi,
> 
> I hope this discussion is not already over. Sorry, but it took me a
> while to understand all the implications and to skip through some RFC's.

[see my brand-new signature ;)

> On Fri, Feb 14, 2003 at 09:06:12AM +0100, Harald Welte wrote:
> > From https://bugzilla.netfilter.org/cgi-bin/bugzilla/show_bug.cgi?id=48
> 
> >> ip_conntrack defrags packets at PRE_ROUTING and LOCAL_OUT and
> >> refragments them at POST_ROUTING without careing about IP_DF. packets
> 
> What has IP_DF (I hope you mean the "Don't Fragment" bit in the
> IP-header) to do with _de_-fragmentation? As far as I understood RFC791
> ("Any internet datagram so marked is not to be internet fragmented under
> any circumstances.") this should never happen, and if so another host
> down the route has already f***ed up the packet. 
> I haven't tested whether iptables really ignores IP_DF in POST_ROUTING,
> but if so, this is a serious bug and should be fixed.

the problem is that at POST_ROUTING we no longer know what the original
packet size was... and thus don't know if the resulting fragments are
smaller than the fragments originally received at PRE_ROUTING.

> >> with IP_DF|IP_MF can be refragmented with a different size, so path
> >> mtu discovery is broken.  Linux nfs itself sends out packets with
> >> IP_DF|IP_MF.
> 
> Could somebody please explain the notation: "IP_DF|IP_MF" to me? Does
> this mean at least one of both flags is set? And if so this is against
> my understanding of the above mentioned RFC791. If IP_DF is set, the
> packet _must not_ be fragmented, so IP_MF can't be set.

I'm sorry, but I don't want to start describing how IP works.  I hope
his is no offence, but there are plenty of locations on the net [and in
books] where you can get this information from.

> >> The problem is that we _need_ to defragment at NF_IP_PRE_ROUTING in
> >> order to have the be able to do connection tracking.  So at this point
> >> we would need to save the sizes of all individual fragments.  This
> >> would enable us to re-fragment to exactly the same size at
> >> POST_ROUTING. 
> 
> Do we really have to re-fragment to exactly the same size? Wouldn't it
> be sufficient to re-fragment to fragments not bigger in size than the
> biggest incoming fragment of this connection?

Either we want to be transparent, or we don't want to.  At least the
current behaviour is well-documented and logical.  If we change this
code now, the fragment sizes should not differ (unless NAT did resize
packets, of course).

> 
> >> And then, what happens if NAT has to resize (enlarge/shrink) a packet.
> >> How should we deal with this while re-fragmenting? 
> 
> In my opinion we should just refragment it, as any router would do it.

routers don't do refragmentation. and it sucks to have a small fragment
in a TCP session that is otherwise using PMTU. But yes, I guess there is
no solution.

> As a final comment: What are overlapping fragments good for, except
> trying to fool NIDS's? As I could not find any comment on how to deal
> with those in the RFC's, we will not break an RFC by "fixing" the
> fragments.

I don't think there is any reasonable IDS that isn't able to do packet
reassembly in a correct way.  Most of the time, this happens after
grabbing the packets from a raw socket.

> Ciao!
>     Thomas

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC]: ip_conntrack breaks UDP PMTU
  2003-02-15 20:50   ` Patrick McHardy
@ 2003-02-16 23:55     ` Thomas Poehnitzsch
  2003-02-17  0:39       ` Patrick McHardy
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Poehnitzsch @ 2003-02-16 23:55 UTC (permalink / raw)
  To: netfilter developer mailinglist

[-- Attachment #1: Type: text/plain, Size: 3259 bytes --]

Hi Patrick,

thanks for enlightening me.

On Sat, Feb 15, 2003 at 09:50:59PM +0100, Patrick McHardy wrote:

> both are set. "|" is logical or. nfs (always?) generates packets bigger 
> than mtu

In my new understanding, this very much depends on the MTU and the size
of the NFS-operation that has to be sent in a single datagram.

> so they are fragmented and have IP_MF set (except last one). If linux 
> wants to
> know path mtu it sets IP_DF on these, so the fragments may not be _further_
> fragmented.

You are right, my understanding of PMTUD with UDP was slightly wrong.
So the problem is not unique to NFS, but can appear in any application
using UDP, with PMTUD enabled? It just needs to send indivisible
datagrams bigger than the smallest MTU on the route.

> that router is broken too. think about a host doing path mtu discovery, 
> the packet
> doesn't fit the interface mtu but nat shrinks the packet so it does fit 
> .. the host gets
> a wrong idea of the pmtu. unfortunately i don't know of a way to fix it 
> except maybe
> to also consider the removed bytes when deciding if a packet needs to be 
> fragmented.

I just skimmed through RFC1631, and have to admit I completely forgot
about NAT changing the packet size. And yes, your idea of considering
the original size if the packet size decreases by NAT seems to be a good
way. 

If on the other hand the packet size increases a fragmentation
notification may confuse the application. In this case it would probably
be better to do the fragmentation based on the biggest fragment of the
datagram.

> even worse we need to store the fragment sizes of each reassembled 
> packet. if we consider
> the case not all fragments have DF set and we would want to handle nat 
> resizing correctly
> besides fragment sizes we also need fragment boundaries and fragment 
> flags (-> iph->frag_off).

But how to calculate the fragment boundaries after a nat-helper has
shrunken/enlarged the packet? Wouldn't this mean you have to let those
(fragment-)packets without the DF flag pass (fragmented if necessary)
and ask for fragmentation of those with DF set?
But with conntrack you have to choose an all or nothing approach. So how
do you ask for the retransmission of all packets/fragments?

Furthermore the ICMP error message may then contain data changed by NAT
and thus unknown to the application. (But I think somebody (you?) has
mentioned this before.)

To me this looks like a situation that cannot be handled properly
without breaking anything or making some assumptions. :-(

And what about overlapping fragments? The overlapping data might be
be different after NAT.

Are you guys already through all the considerations and hacking it into
iptables?

Ciao!
    Thomas

PS: In my opinion NFS over UDP should only be used in LAN's. When
    refragmentation or packet loss becomes a problem NFS over TCP would
    probably be the better choice.

-- 
The key words "MUST", "MUST NOT", "DO", "DON'T", "REQUIRED", "SHALL",
"SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", "MAY BE"
and "OPTIONAL" in this document do not mean anything.
                                                         -- RFC 3251

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC]: ip_conntrack breaks UDP PMTU
  2003-02-16 23:55     ` Thomas Poehnitzsch
@ 2003-02-17  0:39       ` Patrick McHardy
  0 siblings, 0 replies; 10+ messages in thread
From: Patrick McHardy @ 2003-02-17  0:39 UTC (permalink / raw)
  To: Thomas Poehnitzsch; +Cc: netfilter developer mailinglist

Thomas Poehnitzsch wrote:

>Hi Patrick,
>
>thanks for enlightening me.
>
>On Sat, Feb 15, 2003 at 09:50:59PM +0100, Patrick McHardy wrote:
> 
>  
>
>>both are set. "|" is logical or. nfs (always?) generates packets bigger 
>>than mtu
>>    
>>
>
>In my new understanding, this very much depends on the MTU and the size
>of the NFS-operation that has to be sent in a single datagram.
>  
>

i read somewhere else nfs is unable to split up some operations over
multiple packets so it has so create bigger packets than local interface
mtu (assuming ethernet mtu).

>>so they are fragmented and have IP_MF set (except last one). If linux 
>>wants to
>>know path mtu it sets IP_DF on these, so the fragments may not be _further_
>>fragmented.
>>    
>>
>
>You are right, my understanding of PMTUD with UDP was slightly wrong.
>So the problem is not unique to NFS, but can appear in any application
>using UDP, with PMTUD enabled? It just needs to send indivisible
>datagrams bigger than the smallest MTU on the route.
>

yes if conntrack is running at the place of the mtu transition.

>I just skimmed through RFC1631, and have to admit I completely forgot
>about NAT changing the packet size. And yes, your idea of considering
>the original size if the packet size decreases by NAT seems to be a good
>way. 
>
>If on the other hand the packet size increases a fragmentation
>notification may confuse the application. In this case it would probably
>be better to do the fragmentation based on the biggest fragment of the
>datagram.
>

this seems like a good idea.

>>even worse we need to store the fragment sizes of each reassembled 
>>packet. if we consider
>>the case not all fragments have DF set and we would want to handle nat 
>>resizing correctly
>>besides fragment sizes we also need fragment boundaries and fragment 
>>flags (-> iph->frag_off).
>>    
>>
>
>But how to calculate the fragment boundaries after a nat-helper has
>shrunken/enlarged the packet? Wouldn't this mean you have to let those
>(fragment-)packets without the DF flag pass (fragmented if necessary)
>and ask for fragmentation of those with DF set?
>
i think the important thing is to preserve fragment sizes. just handle 
all new data
as beeing added at the end and then do the fragmentation.

>But with conntrack you have to choose an all or nothing approach. So how
>do you ask for the retransmission of all packets/fragments?
>
upper layer protocols get the reassembled packet, so there is no way to 
request
retransmission of single fragments. despite that, the sender might not even
no that fragmentation happend.

>Furthermore the ICMP error message may then contain data changed by NAT
>and thus unknown to the application. (But I think somebody (you?) has
>mentioned this before.)
>
yes it was mentioned a number of times. i don't think any os out there tries
to pass fragmentation required messages to an application, but i don't 
know ...

>To me this looks like a situation that cannot be handled properly
>without breaking anything or making some assumptions. :-(
>
>And what about overlapping fragments? The overlapping data might be
>be different after NAT.
>
i think fragmentation as seen during normal communication should not be 
to hard
to handle. the problems are overlapping fragments and many small, 
differently
sized fragments. also normal linux defragmentation which is used atm eats
the ip headers of the single fragments (expect first) during reassembly. 
these may
contain options which also enlarge the packet, so at maybe nop option 
padding or
something like this has to be done (which could turn our to be useful 
for packets
shrunk be nat ;))

bye,
patrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-02-17  0:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-14  8:06 [RFC]: ip_conntrack breaks UDP PMTU Harald Welte
2003-02-14 13:42 ` Patrick McHardy
2003-02-14 14:55   ` Harald Welte
2003-02-15  5:12     ` Patrick McHardy
2003-02-15 19:34   ` [netfilter-core] " Jozsef Kadlecsik
2003-02-15 17:58 ` Thomas Poehnitzsch
2003-02-15 20:50   ` Patrick McHardy
2003-02-16 23:55     ` Thomas Poehnitzsch
2003-02-17  0:39       ` Patrick McHardy
2003-02-16 19:54   ` Harald Welte

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.