* PMTU issues due to TOS field manipulation (for DSCP)
@ 2003-12-10 18:23 Kevin W. Rudd
2003-12-10 19:34 ` Andi Kleen
0 siblings, 1 reply; 31+ messages in thread
From: Kevin W. Rudd @ 2003-12-10 18:23 UTC (permalink / raw)
To: David Miller, Alexey Kuznetsov; +Cc: netdev
Here is a recent posting to linux-kernel related to PMTU issues when
routers use DSCP marking.
> Subject Yet another UDP pmtud iss, it's different, really
> Date Sun, 16 Nov 2003 22:25:08 -0800
> From "Johnson, Chester F" <chester.f.johnson@intel.com>
>
> This is not the same as the pmtud issues discussed ad-nauseum from 1999
> through 2001. It really is different. Trust me, please read on.
>
> Well, it is similar, but with a twist. We are in the middle of deploying
> DiffServ compliant QoS throughout our networks and stumbled across an
> issue that occurs when we configure our routers to mark the DiffServ
> Code Points (DSCP) for UDP traffic (AFS, NFS, other full frame size UDP
> traffic).
>
> The problem is that when the marked traffic reaches an IPsec/Ethernet
> segment, and the DF bit set to true, an ICMP message is returned to the
> transmitting host to say basically "fix your MTU". Since we have changed
> the ToS field with DSCP information, the ICMP message no longer matches
> anything in the route cache hash. If the ToS field is not "0", it must
> match src, dst, and ToS in the cache. Well, we changed one of them and
> there can be no such match.
>
> The net result is that the transmitting host sends another 1500 byte
> packet and the process repeats itself. Ultimately the data transfer
> fails. When we stop DSCP marking, MTU negotiation works just fine, but
> we have no QoS.
>
> This kind of match might be great if we use a Linux platform as a
> router. It may indeed be useful for higher performance DiffServ routing.
> This kind of match requirement for an end-host is problematic. In our
> estimation it looks like a bug.
>
> Can anyone out there help sort this out?
>
> Chester Johnson
> Network Transport Engineering
> Intel Corporation
>
At least in the case of a "Destination unreachable/Fragmentation
needed" ICMP message, there is an assumption that the TOS value
returned will not have changed. The ip_rt_frag_needed() routine
will fail to find a cached route with a matching TOS (since it has
been changed by the DSCP marking) and so the MTU will not be properly
updated. Given that the DS field definition supersedes the previous
TOS definitions, this does indeed present a problem for route modifying
ICMP messages in a network environment that is using DSCP marking.
This particular user would like the ability to turn off the caching of
the TOS value within the routing tables. On the surface, it looks like
simply manipulating the IPTOS_RT_MASK would accomplish what they are
looking for with minimal code changes. This can currently be done with
something equivalent in include/net/route.h:
#ifndef CONFIG_IP_ROUTE_TOS
#define IPTOS_RT_MASK 0
#endif
and then rebuilding the kernel with the CONFIG_IP_ROUTE_TOS unset. If
the approach of zeroing out the TOS routing mask seems reasonable, a
more dynamic approach would be desired (a sysctl variable that can be
modified without having to build a custom kernel).
Long term, does it really make sense to continue trying to make routing
decisions based on TOS when this field is obsolete?
Thoughts? Comments?
Thanks,
-Kevin
--
Kevin W. Rudd
Linux Change Team
IBM Global Services
1-800-426-7378, T/L 775-4161
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-10 18:23 PMTU issues due to TOS field manipulation (for DSCP) Kevin W. Rudd @ 2003-12-10 19:34 ` Andi Kleen 2003-12-10 20:20 ` Julian Anastasov ` (2 more replies) 0 siblings, 3 replies; 31+ messages in thread From: Andi Kleen @ 2003-12-10 19:34 UTC (permalink / raw) To: Kevin W. Rudd; +Cc: davem, kuznet, netdev, chester.f.johnson On Wed, 10 Dec 2003 10:23:15 -0800 (PST) "Kevin W. Rudd" <ruddk@us.ibm.com> wrote: > > Thoughts? Comments? I don't think the network users will be very happy if you require changing all end hosts this way. How about the following hack? For DF=1 packets the ipid field is useless. When you rewrite the TOS to DSCP save the old TOS in the ipid field. When you see an ICMP fragment required message with the right DSCP on the router restore the old TOS from the ipid field. One drawback is that there are a few VJ header compression servers that require changing ipid. But you can still handle these by not overwriting all bits of ipid so that it still changes (keep a few changing bits) -Andi ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-10 19:34 ` Andi Kleen @ 2003-12-10 20:20 ` Julian Anastasov 2003-12-10 20:33 ` Nivedita Singhvi 2003-12-10 20:26 ` Kevin W. Rudd 2003-12-10 20:30 ` Nivedita Singhvi 2 siblings, 1 reply; 31+ messages in thread From: Julian Anastasov @ 2003-12-10 20:20 UTC (permalink / raw) To: Andi Kleen; +Cc: Kevin W. Rudd, davem, kuznet, netdev, chester.f.johnson Hello, On Wed, 10 Dec 2003, Andi Kleen wrote: > I don't think the network users will be very happy if you require changing all > end hosts this way. How about the following hack? For DF=1 packets > the ipid field is useless. When you rewrite the TOS to DSCP save > the old TOS in the ipid field. When you see an ICMP fragment required > message with the right DSCP on the router restore the old TOS from the ipid > field. It can be an option: pmtu_promisc or whatever (default 0). We should avoid using TOS as hash key and if the var is 1 we should ignore matching the tos keys in ip_rt_frag_needed and ip_rt_redirect. By this way, all cache entries between saddr and daddr will be updated, no matter the TOS or even OIF keys. Comments? Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-10 20:20 ` Julian Anastasov @ 2003-12-10 20:33 ` Nivedita Singhvi 2003-12-10 21:18 ` Julian Anastasov 0 siblings, 1 reply; 31+ messages in thread From: Nivedita Singhvi @ 2003-12-10 20:33 UTC (permalink / raw) To: Julian Anastasov Cc: Andi Kleen, Kevin W. Rudd, davem, kuznet, netdev, chester.f.johnson Julian Anastasov wrote: > It can be an option: pmtu_promisc or whatever (default 0). > We should avoid using TOS as hash key and if the var is 1 we should > ignore matching the tos keys in ip_rt_frag_needed and ip_rt_redirect. > By this way, all cache entries between saddr and daddr will be updated, no > matter the TOS or even OIF keys. Comments? This is what I was going to suggest too, but one of the drawbacks would be needing the user to recompile the distro kernel just for this..could we keep any fix to this dynamically tunable? thanks, Nivedita ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-10 20:33 ` Nivedita Singhvi @ 2003-12-10 21:18 ` Julian Anastasov 2003-12-10 22:36 ` Nivedita Singhvi 0 siblings, 1 reply; 31+ messages in thread From: Julian Anastasov @ 2003-12-10 21:18 UTC (permalink / raw) To: Nivedita Singhvi Cc: Andi Kleen, Kevin W. Rudd, davem, kuznet, netdev, chester.f.johnson Hello, On Wed, 10 Dec 2003, Nivedita Singhvi wrote: > Julian Anastasov wrote: > > > It can be an option: pmtu_promisc or whatever (default 0). > > We should avoid using TOS as hash key and if the var is 1 we should > > ignore matching the tos keys in ip_rt_frag_needed and ip_rt_redirect. > > By this way, all cache entries between saddr and daddr will be updated, no > > matter the TOS or even OIF keys. Comments? > > This is what I was going to suggest too, but one of the drawbacks > would be needing the user to recompile the distro kernel just > for this..could we keep any fix to this dynamically tunable? Yes, I mean sysctl var in /proc. For now, there are two needs: 1. update PMTU info ignoring TOS and OIF 2. Reduce the number of hash entries by ignoring the TOS, may be depending on CONFIG_IP_ROUTE_TOS, it would be preferred to be sysctl var but the real benefits depend on the TOS usage. I see three cases: 1. we do not route by TOS => we want to ignore TOS everywhere, if possible, sysctl var variant for IPTOS_RT_MASK=0 2. we route by TOS (IPTOS_RT_MASK!=0) but we like the idea to ignore TOS for PMTUD and ICMP redirects. 3. the current behavior, PMTUD does not work if TOS is changed, the routing cache keeps many entries with different TOS The question is how many proc entries are needed for these cases. It seems, we can cover all cases with one proc entry: to make IPTOS_RT_MASK a sysctl var. But there is still difference between cases 1 and 2, may be the var will have more than 2 values? > thanks, > Nivedita Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-10 21:18 ` Julian Anastasov @ 2003-12-10 22:36 ` Nivedita Singhvi 2003-12-10 22:51 ` David S. Miller 0 siblings, 1 reply; 31+ messages in thread From: Nivedita Singhvi @ 2003-12-10 22:36 UTC (permalink / raw) To: Julian Anastasov Cc: Andi Kleen, Kevin W. Rudd, davem, kuznet, netdev, chester.f.johnson Julian Anastasov wrote: > I see three cases: > > 1. we do not route by TOS => we want to ignore TOS everywhere, if > possible, sysctl var variant for IPTOS_RT_MASK=0 > > 2. we route by TOS (IPTOS_RT_MASK!=0) but we like the idea to > ignore TOS for PMTUD and ICMP redirects. > 3. the current behavior, PMTUD does not work if TOS is changed, > the routing cache keeps many entries with different TOS > The question is how many proc entries are needed for these > cases. It seems, we can cover all cases with one proc entry: to > make IPTOS_RT_MASK a sysctl var. But there is still difference > between cases 1 and 2, may be the var will have more than 2 values? 1. we route by TOS - yes (no PMTUD/ICMP conflict) MASK = $mask - no (PMTUD/ICMP conflict) MASK = 0 2. we do not route by TOS at all MASK = 0 Unless the kernel is checking on the fly, you do not need to distinguish between 1b and 2, correct? (i.e.Since we are just expecting the user to set the sysctl variable). 1a is your case 3, current behaviour. thanks, Nivedita ' ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-10 22:36 ` Nivedita Singhvi @ 2003-12-10 22:51 ` David S. Miller 2003-12-10 23:15 ` Julian Anastasov 0 siblings, 1 reply; 31+ messages in thread From: David S. Miller @ 2003-12-10 22:51 UTC (permalink / raw) To: Nivedita Singhvi; +Cc: ja, ak, ruddk, kuznet, netdev, chester.f.johnson Here is my take on this, as far as Linux is concerned. I agree with the three behaviors proposed by Julian. However I have some slight trouble with the ignore-TOS-for- PMTU idea, implementation wise. Walking the routing hash table for each possible TOS value is going to be computationally expensive, and is inviting computational complexity DDoS attacks by bombing the machine with PMTU ICMP messages. That is the most obvious implementation, and I'm not saying there are not others. I just have no alternatives in mind right now :) But once that issue is resolved I'm more than happy to put a patch in which does this stuff. We even have been speaking about this in other threads on netdev wrt. Julian's patches. TOS is truly a value with only network local meaning and hops are going to modify the value on us. I'm actually surprised this is the first time the issue has been seriously hit. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-10 22:51 ` David S. Miller @ 2003-12-10 23:15 ` Julian Anastasov 2003-12-10 23:20 ` David S. Miller 0 siblings, 1 reply; 31+ messages in thread From: Julian Anastasov @ 2003-12-10 23:15 UTC (permalink / raw) To: David S. Miller Cc: Nivedita Singhvi, ak, ruddk, kuznet, netdev, chester.f.johnson Hello, On Wed, 10 Dec 2003, David S. Miller wrote: > Here is my take on this, as far as Linux is concerned. > > I agree with the three behaviors proposed by Julian. > However I have some slight trouble with the ignore-TOS-for- > PMTU idea, implementation wise. > > Walking the routing hash table for each possible TOS value > is going to be computationally expensive, and is inviting > computational complexity DDoS attacks by bombing the machine > with PMTU ICMP messages. What about not using TOS as hash key, then we will see all entries for same SADDR->DADDR but with different TOS values in same table row. I hope it will not hurt the jenkins hash too much but it is evident that we put all these entries with different TOS and OIF on same table row. It seems, there are no many users of OIF!=0 but if TOS is used as routing key we can see up to 8 entries with different TOS for same SADDR,DADDR. Of course, it looks difficult to walk 8 rows just to check all TOS variants, the common case is to see only one TOS value used. That is why I propose to eliminate the TOS as hash key and to walk one row. At first look, the risk of DoS is same, thanks to the random value. Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-10 23:15 ` Julian Anastasov @ 2003-12-10 23:20 ` David S. Miller 2003-12-11 0:06 ` Julian Anastasov 0 siblings, 1 reply; 31+ messages in thread From: David S. Miller @ 2003-12-10 23:20 UTC (permalink / raw) To: Julian Anastasov; +Cc: niv, ak, ruddk, kuznet, netdev, chester.f.johnson On Thu, 11 Dec 2003 01:15:06 +0200 (EET) Julian Anastasov <ja@ssi.bg> wrote: > It seems, there are no many users of OIF!=0 but if TOS is > used as routing key we can see up to 8 entries with different > TOS for same SADDR,DADDR. Of course, it looks difficult to > walk 8 rows just to check all TOS variants, the common case > is to see only one TOS value used. That is why I propose > to eliminate the TOS as hash key and to walk one row. At first > look, the risk of DoS is same, thanks to the random value. This is not my understanding at all. Consider the case where we generate routing cache entries for all 8 TOS values. Currently we'll likely get a O(1) lookup for any one of those entries. Your proposal guarentees that all such entries will land to the same hash chain, since TOS is not an input for the hash any longer. Therefore the lookup in my example case will be O(8). And instead of just eating the complexity at ICMP PMTU handling time, we eat the complexity at every routing cache lookup. I really don't think we can consider this. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-10 23:20 ` David S. Miller @ 2003-12-11 0:06 ` Julian Anastasov 2003-12-11 0:09 ` David S. Miller 0 siblings, 1 reply; 31+ messages in thread From: Julian Anastasov @ 2003-12-11 0:06 UTC (permalink / raw) To: David S. Miller; +Cc: niv, ak, ruddk, kuznet, netdev, chester.f.johnson Hello, On Wed, 10 Dec 2003, David S. Miller wrote: > > is to see only one TOS value used. That is why I propose > > to eliminate the TOS as hash key and to walk one row. At first > > look, the risk of DoS is same, thanks to the random value. > > This is not my understanding at all. > > Consider the case where we generate routing cache entries for > all 8 TOS values. Currently we'll likely get a O(1) lookup > for any one of those entries. IMO, the main question is whether all chains are with ~equal length, this depends on how good is the hash function. And we hope it is good. Because these entries are always linked somewhere and if they do not slowdown one chain they surely slowdown other chains. The total walk time is always same if all entries are searched. This is more and more true if the average chain length is longer than the number of TOS values used. Of course, if the table is empty and the average chain length is 1-4 using TOS as hash key will be faster for 8 TOS values. But the common average chain length for full table is 16 and the different TOS values are usually 1 or 2 per path. We do not talk about intentional DoS. > Your proposal guarentees that all such entries will land to the > same hash chain, since TOS is not an input for the hash any longer. > Therefore the lookup in my example case will be O(8). Yes, it will look very slow on empty table and when we do not ignore the TOS values. > And instead of just eating the complexity at ICMP PMTU handling > time, we eat the complexity at every routing cache lookup. For solving the PMTUD problems, I still do not know how we can avoid walking 8 chains if the TOS is a hash key. This is a real DoS because 8 is too longer compared with the common case of walking 1 chain with 1-2 TOS values. BTW, there are 1-2 entries for OIF too. Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-11 0:06 ` Julian Anastasov @ 2003-12-11 0:09 ` David S. Miller 2003-12-11 0:34 ` Julian Anastasov 0 siblings, 1 reply; 31+ messages in thread From: David S. Miller @ 2003-12-11 0:09 UTC (permalink / raw) To: Julian Anastasov; +Cc: niv, ak, ruddk, kuznet, netdev, chester.f.johnson On Thu, 11 Dec 2003 02:06:09 +0200 (EET) Julian Anastasov <ja@ssi.bg> wrote: > But the common average chain length for full table is 16 and the > different TOS values are usually 1 or 2 per path. We do not talk > about intentional DoS. Your system has not configured it's rthash table size properly, see recent discussions Robert has been making here. You should have rthash sized somewhere near the number of entries a full system will have. But regardless, let us say that your system has complexity O(16) lookups as you mention, your proposal changes this to O(16+8). ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-11 0:09 ` David S. Miller @ 2003-12-11 0:34 ` Julian Anastasov 2003-12-12 8:31 ` David S. Miller 0 siblings, 1 reply; 31+ messages in thread From: Julian Anastasov @ 2003-12-11 0:34 UTC (permalink / raw) To: David S. Miller; +Cc: niv, ak, ruddk, kuznet, netdev, chester.f.johnson Hello, On Wed, 10 Dec 2003, David S. Miller wrote: > On Thu, 11 Dec 2003 02:06:09 +0200 (EET) > Julian Anastasov <ja@ssi.bg> wrote: > > > But the common average chain length for full table is 16 and the > > different TOS values are usually 1 or 2 per path. We do not talk > > about intentional DoS. > > Your system has not configured it's rthash table size properly, see > recent discussions Robert has been making here. You should have > rthash sized somewhere near the number of entries a full system > will have. > > But regardless, let us say that your system has complexity O(16) > lookups as you mention, your proposal changes this to O(16+8). It is ~16 :) ip_rt_max_size = (rt_hash_mask + 1) * 16; This is what happens on full table, of course. OK, some simple numbers for an ideal table: - full table with 1024 chains, 16384 (max_size) entries equally distributed in these 1024 chains, 16 per chain. - there are 2048 paths with same saddr->daddr, each has 8 TOS values 2 cases depending on whether TOS is a hash key (path=saddr->daddr): 1. TOS is a hash key: - in each chain we have 16 paths, 1 TOS value per path - all 8 TOS values for a path are in 8 different chains 2. TOS is not a hash key: 2 paths per chain (2 paths x 8 TOS values => 16 entries) if all saddr->daddr->tos streams have same packet rate I think the CPU time to lookup them will be same. This is because 8 (number of TOS values) < 16 (chain length). And I hope the users always can tune the proposed TOS settings if they see DoS and if they do not need TOS as a rt key. Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-11 0:34 ` Julian Anastasov @ 2003-12-12 8:31 ` David S. Miller 2003-12-12 23:38 ` Julian Anastasov 2003-12-13 0:10 ` Julian Anastasov 0 siblings, 2 replies; 31+ messages in thread From: David S. Miller @ 2003-12-12 8:31 UTC (permalink / raw) To: Julian Anastasov; +Cc: niv, ak, ruddk, kuznet, netdev, chester.f.johnson On Thu, 11 Dec 2003 02:34:51 +0200 (EET) Julian Anastasov <ja@ssi.bg> wrote: > On Wed, 10 Dec 2003, David S. Miller wrote: > > > But regardless, let us say that your system has complexity O(16) > > lookups as you mention, your proposal changes this to O(16+8). > > It is ~16 :) > > ip_rt_max_size = (rt_hash_mask + 1) * 16; > > This is what happens on full table, of course. OK, > some simple numbers for an ideal table: But look at default gc_thresh setting, which is when we trim rt cache entries: ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1); The ip_rt_max_size value is meant to be a sort of buffer to absorb the situation where many rt cache entries are unreclaimable. But this is a seperate issue, and we can discuss your further points regardless. > 2 cases depending on whether TOS is a hash key (path=saddr->daddr): > > 1. TOS is a hash key: > > - in each chain we have 16 paths, 1 TOS value per path > - all 8 TOS values for a path are in 8 different chains > > 2. TOS is not a hash key: > > 2 paths per chain (2 paths x 8 TOS values => 16 entries) > > if all saddr->daddr->tos streams have same packet rate I think > the CPU time to lookup them will be same. > This is because 8 (number of TOS values) < 16 (chain length). > > And I hope the users always can tune the proposed TOS > settings if they see DoS and if they do not need TOS as a rt key. Ok. I agree with your analysis. Let's propose something concrete. 1) PMTU processing applies PMTU change to all TOS'd instances of a route. This behavior change is sysctl controllable, and on by default. The implementation is to just lookup all 8 possible TOS values. 2) Whether TOS is a routing cache hash key is controlled by another sysctl. When CONFIG_IP_ROUTE_TOS is set this sysctl defaults to on, other- wise it defaults to off. I think #2 should be very safe because fib node fn_tos values are only ever set when that config variable is enabled, and fib rule r_tos values are only compared on lookup when it is enabled as well. However, there could be a few more ifdefs added to the fib rule code to cover all the assignment cases too but let's not worry about that right now. Comments? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-12 8:31 ` David S. Miller @ 2003-12-12 23:38 ` Julian Anastasov 2003-12-18 23:17 ` Kevin W. Rudd 2004-01-19 22:43 ` Kevin W. Rudd 2003-12-13 0:10 ` Julian Anastasov 1 sibling, 2 replies; 31+ messages in thread From: Julian Anastasov @ 2003-12-12 23:38 UTC (permalink / raw) To: David S. Miller; +Cc: niv, ak, ruddk, kuznet, netdev, chester.f.johnson Hello, On Fri, 12 Dec 2003, David S. Miller wrote: > 1) PMTU processing applies PMTU change to all TOS'd instances of > a route. This behavior change is sysctl controllable, and > on by default. > > The implementation is to just lookup all 8 possible TOS values. > > 2) Whether TOS is a routing cache hash key is controlled by another > sysctl. > > When CONFIG_IP_ROUTE_TOS is set this sysctl defaults to on, other- > wise it defaults to off. > > I think #2 should be very safe because fib node fn_tos values are only > ever set when that config variable is enabled, and fib rule r_tos values > are only compared on lookup when it is enabled as well. However, there > could be a few more ifdefs added to the fib rule code to cover all the > assignment cases too but let's not worry about that right now. OK, here is a change that compiles. No fib changes yet (it is the next step). I'm not sure what var names are better. > Comments? What about ip_rt_redirect? May be we need the same TOS matching behavior if TOS is changed somewhere after routing? diff -Nru a/include/linux/sysctl.h b/include/linux/sysctl.h --- a/include/linux/sysctl.h Sat Dec 13 01:16:15 2003 +++ b/include/linux/sysctl.h Sat Dec 13 01:16:15 2003 @@ -329,6 +329,8 @@ NET_IPV4_ROUTE_MIN_PMTU=16, NET_IPV4_ROUTE_MIN_ADVMSS=17, NET_IPV4_ROUTE_SECRET_INTERVAL=18, + NET_IPV4_ROUTE_IGNORE_TOS=19, + NET_IPV4_ROUTE_PMTU_MODE=20, }; enum diff -Nru a/net/ipv4/route.c b/net/ipv4/route.c --- a/net/ipv4/route.c Sat Dec 13 01:16:15 2003 +++ b/net/ipv4/route.c Sat Dec 13 01:16:15 2003 @@ -124,6 +124,14 @@ int ip_rt_min_advmss = 256; int ip_rt_secret_interval = 10 * 60 * HZ; static unsigned long rt_deadline; +#ifdef CONFIG_IP_ROUTE_TOS +static u8 iptos_rt_mask = IPTOS_RT_MASK; +int ip_rt_ignore_tos; +#else +static u8 iptos_rt_mask; +int ip_rt_ignore_tos = 1; +#endif +int ip_rt_pmtu_mode; /* 1=match by iph->tos, 0=ignore TOS for PMTU */ #define RTprint(a...) printk(KERN_DEBUG a) @@ -967,13 +975,12 @@ void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw, u32 saddr, u8 tos, struct net_device *dev) { - int i, k; + int i; struct in_device *in_dev = in_dev_get(dev); struct rtable *rth, **rthp; u32 skeys[2] = { saddr, 0 }; - int ikeys[2] = { dev->ifindex, 0 }; - tos &= IPTOS_RT_MASK; + tos &= iptos_rt_mask; if (!in_dev) return; @@ -993,10 +1000,7 @@ } for (i = 0; i < 2; i++) { - for (k = 0; k < 2; k++) { - unsigned hash = rt_hash_code(daddr, - skeys[i] ^ (ikeys[k] << 5), - tos); + unsigned hash = rt_hash_code(daddr, skeys[i], tos); rthp=&rt_hash_table[hash].chain; @@ -1008,7 +1012,9 @@ if (rth->fl.fl4_dst != daddr || rth->fl.fl4_src != skeys[i] || rth->fl.fl4_tos != tos || - rth->fl.oif != ikeys[k] || + (rth->fl.oif && + rth->fl.oif != dev->ifindex) || + rth->rt_dst == rth->rt_gateway || rth->fl.iif != 0) { rthp = &rth->u.rt_next; continue; @@ -1075,7 +1081,6 @@ rcu_read_unlock(); do_next: ; - } } in_dev_put(in_dev); return; @@ -1105,8 +1110,7 @@ } else if ((rt->rt_flags & RTCF_REDIRECTED) || rt->u.dst.expires) { unsigned hash = rt_hash_code(rt->fl.fl4_dst, - rt->fl.fl4_src ^ - (rt->fl.oif << 5), + rt->fl.fl4_src, rt->fl.fl4_tos); #if RT_CACHE_DEBUG >= 1 printk(KERN_DEBUG "ip_rt_advice: redirect to " @@ -1237,21 +1241,33 @@ return 68; } +/* See IPTOS_RT_MASK */ +static u8 all_tos_values[8] = { 0x00, 0x04, 0x08, 0x0C, 0x10, 0x14, 0x18, 0x1C }; + unsigned short ip_rt_frag_needed(struct iphdr *iph, unsigned short new_mtu) { - int i; + int i, j, ntos; unsigned short old_mtu = ntohs(iph->tot_len); struct rtable *rth; u32 skeys[2] = { iph->saddr, 0, }; u32 daddr = iph->daddr; - u8 tos = iph->tos & IPTOS_RT_MASK; + u8 *tos_values, tos = iph->tos & iptos_rt_mask; unsigned short est_mtu = 0; if (ipv4_config.no_pmtu_disc) return 0; + if (ip_rt_pmtu_mode || !iptos_rt_mask) { + tos_values = &tos; + ntos = 1; + } else { + tos_values = all_tos_values; + ntos = ARRAY_SIZE(all_tos_values); + } + + for (j = 0; j < ntos; j++) for (i = 0; i < 2; i++) { - unsigned hash = rt_hash_code(daddr, skeys[i], tos); + unsigned hash = rt_hash_code(daddr, skeys[i], tos_values[j]); rcu_read_lock(); for (rth = rt_hash_table[hash].chain; rth; @@ -1259,9 +1275,9 @@ smp_read_barrier_depends(); if (rth->fl.fl4_dst == daddr && rth->fl.fl4_src == skeys[i] && + rth->fl.fl4_tos == tos_values[j] && rth->rt_dst == daddr && rth->rt_src == iph->saddr && - rth->fl.fl4_tos == tos && rth->fl.iif == 0 && !(dst_metric_locked(&rth->u.dst, RTAX_MTU))) { unsigned short mtu = new_mtu; @@ -1503,7 +1519,7 @@ RT_CACHE_STAT_INC(in_slow_mc); in_dev_put(in_dev); - hash = rt_hash_code(daddr, saddr ^ (dev->ifindex << 5), tos); + hash = rt_hash_code(daddr, saddr, tos); return rt_intern_hash(hash, rth, (struct rtable**) &skb->dst); e_nobufs: @@ -1554,7 +1570,7 @@ if (!in_dev) goto out; - hash = rt_hash_code(daddr, saddr ^ (fl.iif << 5), tos); + hash = rt_hash_code(daddr, saddr, tos); /* Check for the most weird martians, which can be not detected by fib_lookup. @@ -1846,8 +1862,8 @@ unsigned hash; int iif = dev->ifindex; - tos &= IPTOS_RT_MASK; - hash = rt_hash_code(daddr, saddr ^ (iif << 5), tos); + tos &= iptos_rt_mask; + hash = rt_hash_code(daddr, saddr, tos); rcu_read_lock(); for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) { @@ -1912,11 +1928,11 @@ int ip_route_output_slow(struct rtable **rp, const struct flowi *oldflp) { - u32 tos = oldflp->fl4_tos & (IPTOS_RT_MASK | RTO_ONLINK); + u32 tos = oldflp->fl4_tos & (iptos_rt_mask | RTO_ONLINK); struct flowi fl = { .nl_u = { .ip4_u = { .daddr = oldflp->fl4_dst, .saddr = oldflp->fl4_src, - .tos = tos & IPTOS_RT_MASK, + .tos = tos & iptos_rt_mask, .scope = ((tos & RTO_ONLINK) ? RT_SCOPE_LINK : RT_SCOPE_UNIVERSE), @@ -2190,7 +2206,7 @@ rth->rt_flags = flags; - hash = rt_hash_code(oldflp->fl4_dst, oldflp->fl4_src ^ (oldflp->oif << 5), tos); + hash = rt_hash_code(oldflp->fl4_dst, oldflp->fl4_src, tos); err = rt_intern_hash(hash, rth, rp); done: if (free_res) @@ -2213,8 +2229,9 @@ { unsigned hash; struct rtable *rth; + u8 tos = flp->fl4_tos & (iptos_rt_mask | RTO_ONLINK); - hash = rt_hash_code(flp->fl4_dst, flp->fl4_src ^ (flp->oif << 5), flp->fl4_tos); + hash = rt_hash_code(flp->fl4_dst, flp->fl4_src, tos); rcu_read_lock(); for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) { @@ -2226,8 +2243,7 @@ #ifdef CONFIG_IP_ROUTE_FWMARK rth->fl.fl4_fwmark == flp->fl4_fwmark && #endif - !((rth->fl.fl4_tos ^ flp->fl4_tos) & - (IPTOS_RT_MASK | RTO_ONLINK))) { + rth->fl.fl4_tos == tos) { rth->u.dst.lastuse = jiffies; dst_hold(&rth->u.dst); rth->u.dst.__use++; @@ -2479,6 +2495,26 @@ } #ifdef CONFIG_SYSCTL + +static int ipv4_sysctl_doint_strategy(ctl_table *table, int *name, + int nlen, void *oldval, + size_t *oldlenp, void *newval, + size_t newlen, void **context) +{ + int val; + + if (!newval || !newlen) + return 0; + if (newlen != sizeof(int)) + return -EINVAL; + if (get_user(val, (int *)newval)) + return -EFAULT; + if (val == *(int *) table->data) + return 0; + *(int *) table->data = val; + return 1; +} + static int flush_delay; static int ipv4_sysctl_rtcache_flush(ctl_table *ctl, int write, @@ -2508,6 +2544,53 @@ return 0; } +#ifdef CONFIG_IP_ROUTE_TOS + +static void do_ignore_tos(void) +{ + iptos_rt_mask = ip_rt_ignore_tos? 0 : IPTOS_RT_MASK; + rt_cache_flush(0); +} + +#endif + +static int ip_rt_ignore_tos_handler(ctl_table *ctl, int write, + struct file *filp, void *buffer, + size_t *lenp) +{ + if (write) { +#ifdef CONFIG_IP_ROUTE_TOS + int old = ip_rt_ignore_tos; + int ret = proc_dointvec(ctl, write, filp, buffer, lenp); + + if (ret) + return ret; + if (old != ip_rt_ignore_tos) do_ignore_tos(); + return 0; +#else + return -EINVAL; +#endif + } + return proc_dointvec(ctl, write, filp, buffer, lenp); +} +static int ipv4_sysctl_ignore_tos_strategy(ctl_table *table, int *name, + int nlen, void *oldval, + size_t *oldlenp, void *newval, + size_t newlen, void **context) +{ +#ifdef CONFIG_IP_ROUTE_TOS + int ret = ipv4_sysctl_doint_strategy(table, name, nlen, oldval, oldlenp, + newval, newlen, context); + + if (1 != ret) + return ret; + do_ignore_tos(); + return 0; +#else + return (newval || newlen) ? -EINVAL : 0; +#endif +} + ctl_table ipv4_route_table[] = { { .ctl_name = NET_IPV4_ROUTE_FLUSH, @@ -2660,6 +2743,23 @@ .mode = 0644, .proc_handler = &proc_dointvec_jiffies, .strategy = &sysctl_jiffies, + }, + { + .ctl_name = NET_IPV4_ROUTE_IGNORE_TOS, + .procname = "ignore_tos", + .data = &ip_rt_ignore_tos, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &ip_rt_ignore_tos_handler, + .strategy = &ipv4_sysctl_ignore_tos_strategy, + }, + { + .ctl_name = NET_IPV4_ROUTE_PMTU_MODE, + .procname = "pmtu_mode", + .data = &ip_rt_pmtu_mode, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, }, { .ctl_name = 0 } }; Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-12 23:38 ` Julian Anastasov @ 2003-12-18 23:17 ` Kevin W. Rudd 2004-01-19 22:43 ` Kevin W. Rudd 1 sibling, 0 replies; 31+ messages in thread From: Kevin W. Rudd @ 2003-12-18 23:17 UTC (permalink / raw) To: Julian Anastasov Cc: David S. Miller, niv, ak, kuznet, netdev, chester.f.johnson On Sat, 13 Dec 2003, Julian Anastasov wrote: > Subject: Re: PMTU issues due to TOS field manipulation (for DSCP) > > OK, here is a change that compiles. No fib changes yet (it > is the next step). I'm not sure what var names are better. > The changes look good from an operational point of view. Both the default pmtu_mode approach and setting ignore_tos resolved the problem on my test system. > What about ip_rt_redirect? May be we need the same TOS > matching behavior if TOS is changed somewhere after routing? I would think that redirects would suffer from the same problem, but I haven't directly tested that scenario myself. I think it would be prudent to take the same approach in this area (at least have ignore_tos honored). Other than my normal daily interrupts, one of the reasons it took me so long to play with your proposed patch was that I had to get a 2.6 kernel up and running on my test system. The original customer complaint we were following up on (and all of my prior testing) was against a 2.4 kernel. We would like to see these same changes rolled into the 2.4 train also if at all possible. Thanks, -Kevin > > diff -Nru a/include/linux/sysctl.h b/include/linux/sysctl.h > --- a/include/linux/sysctl.h Sat Dec 13 01:16:15 2003 > +++ b/include/linux/sysctl.h Sat Dec 13 01:16:15 2003 > @@ -329,6 +329,8 @@ > NET_IPV4_ROUTE_MIN_PMTU=16, > NET_IPV4_ROUTE_MIN_ADVMSS=17, > NET_IPV4_ROUTE_SECRET_INTERVAL=18, > + NET_IPV4_ROUTE_IGNORE_TOS=19, > + NET_IPV4_ROUTE_PMTU_MODE=20, > }; > > enum > diff -Nru a/net/ipv4/route.c b/net/ipv4/route.c > --- a/net/ipv4/route.c Sat Dec 13 01:16:15 2003 > +++ b/net/ipv4/route.c Sat Dec 13 01:16:15 2003 > @@ -124,6 +124,14 @@ > int ip_rt_min_advmss = 256; > int ip_rt_secret_interval = 10 * 60 * HZ; > static unsigned long rt_deadline; > +#ifdef CONFIG_IP_ROUTE_TOS > +static u8 iptos_rt_mask = IPTOS_RT_MASK; > +int ip_rt_ignore_tos; > +#else > +static u8 iptos_rt_mask; > +int ip_rt_ignore_tos = 1; > +#endif > +int ip_rt_pmtu_mode; /* 1=match by iph->tos, 0=ignore TOS for PMTU */ > > #define RTprint(a...) printk(KERN_DEBUG a) > > @@ -967,13 +975,12 @@ > void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw, > u32 saddr, u8 tos, struct net_device *dev) > { > - int i, k; > + int i; > struct in_device *in_dev = in_dev_get(dev); > struct rtable *rth, **rthp; > u32 skeys[2] = { saddr, 0 }; > - int ikeys[2] = { dev->ifindex, 0 }; > > - tos &= IPTOS_RT_MASK; > + tos &= iptos_rt_mask; > > if (!in_dev) > return; > @@ -993,10 +1000,7 @@ > } > > for (i = 0; i < 2; i++) { > - for (k = 0; k < 2; k++) { > - unsigned hash = rt_hash_code(daddr, > - skeys[i] ^ (ikeys[k] << 5), > - tos); > + unsigned hash = rt_hash_code(daddr, skeys[i], tos); > > rthp=&rt_hash_table[hash].chain; > > @@ -1008,7 +1012,9 @@ > if (rth->fl.fl4_dst != daddr || > rth->fl.fl4_src != skeys[i] || > rth->fl.fl4_tos != tos || > - rth->fl.oif != ikeys[k] || > + (rth->fl.oif && > + rth->fl.oif != dev->ifindex) || > + rth->rt_dst == rth->rt_gateway || > rth->fl.iif != 0) { > rthp = &rth->u.rt_next; > continue; > @@ -1075,7 +1081,6 @@ > rcu_read_unlock(); > do_next: > ; > - } > } > in_dev_put(in_dev); > return; > @@ -1105,8 +1110,7 @@ > } else if ((rt->rt_flags & RTCF_REDIRECTED) || > rt->u.dst.expires) { > unsigned hash = rt_hash_code(rt->fl.fl4_dst, > - rt->fl.fl4_src ^ > - (rt->fl.oif << 5), > + rt->fl.fl4_src, > rt->fl.fl4_tos); > #if RT_CACHE_DEBUG >= 1 > printk(KERN_DEBUG "ip_rt_advice: redirect to " > @@ -1237,21 +1241,33 @@ > return 68; > } > > +/* See IPTOS_RT_MASK */ > +static u8 all_tos_values[8] = { 0x00, 0x04, 0x08, 0x0C, 0x10, 0x14, 0x18, 0x1C }; > + > unsigned short ip_rt_frag_needed(struct iphdr *iph, unsigned short new_mtu) > { > - int i; > + int i, j, ntos; > unsigned short old_mtu = ntohs(iph->tot_len); > struct rtable *rth; > u32 skeys[2] = { iph->saddr, 0, }; > u32 daddr = iph->daddr; > - u8 tos = iph->tos & IPTOS_RT_MASK; > + u8 *tos_values, tos = iph->tos & iptos_rt_mask; > unsigned short est_mtu = 0; > > if (ipv4_config.no_pmtu_disc) > return 0; > > + if (ip_rt_pmtu_mode || !iptos_rt_mask) { > + tos_values = &tos; > + ntos = 1; > + } else { > + tos_values = all_tos_values; > + ntos = ARRAY_SIZE(all_tos_values); > + } > + > + for (j = 0; j < ntos; j++) > for (i = 0; i < 2; i++) { > - unsigned hash = rt_hash_code(daddr, skeys[i], tos); > + unsigned hash = rt_hash_code(daddr, skeys[i], tos_values[j]); > > rcu_read_lock(); > for (rth = rt_hash_table[hash].chain; rth; > @@ -1259,9 +1275,9 @@ > smp_read_barrier_depends(); > if (rth->fl.fl4_dst == daddr && > rth->fl.fl4_src == skeys[i] && > + rth->fl.fl4_tos == tos_values[j] && > rth->rt_dst == daddr && > rth->rt_src == iph->saddr && > - rth->fl.fl4_tos == tos && > rth->fl.iif == 0 && > !(dst_metric_locked(&rth->u.dst, RTAX_MTU))) { > unsigned short mtu = new_mtu; > @@ -1503,7 +1519,7 @@ > RT_CACHE_STAT_INC(in_slow_mc); > > in_dev_put(in_dev); > - hash = rt_hash_code(daddr, saddr ^ (dev->ifindex << 5), tos); > + hash = rt_hash_code(daddr, saddr, tos); > return rt_intern_hash(hash, rth, (struct rtable**) &skb->dst); > > e_nobufs: > @@ -1554,7 +1570,7 @@ > if (!in_dev) > goto out; > > - hash = rt_hash_code(daddr, saddr ^ (fl.iif << 5), tos); > + hash = rt_hash_code(daddr, saddr, tos); > > /* Check for the most weird martians, which can be not detected > by fib_lookup. > @@ -1846,8 +1862,8 @@ > unsigned hash; > int iif = dev->ifindex; > > - tos &= IPTOS_RT_MASK; > - hash = rt_hash_code(daddr, saddr ^ (iif << 5), tos); > + tos &= iptos_rt_mask; > + hash = rt_hash_code(daddr, saddr, tos); > > rcu_read_lock(); > for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) { > @@ -1912,11 +1928,11 @@ > > int ip_route_output_slow(struct rtable **rp, const struct flowi *oldflp) > { > - u32 tos = oldflp->fl4_tos & (IPTOS_RT_MASK | RTO_ONLINK); > + u32 tos = oldflp->fl4_tos & (iptos_rt_mask | RTO_ONLINK); > struct flowi fl = { .nl_u = { .ip4_u = > { .daddr = oldflp->fl4_dst, > .saddr = oldflp->fl4_src, > - .tos = tos & IPTOS_RT_MASK, > + .tos = tos & iptos_rt_mask, > .scope = ((tos & RTO_ONLINK) ? > RT_SCOPE_LINK : > RT_SCOPE_UNIVERSE), > @@ -2190,7 +2206,7 @@ > > rth->rt_flags = flags; > > - hash = rt_hash_code(oldflp->fl4_dst, oldflp->fl4_src ^ (oldflp->oif << 5), tos); > + hash = rt_hash_code(oldflp->fl4_dst, oldflp->fl4_src, tos); > err = rt_intern_hash(hash, rth, rp); > done: > if (free_res) > @@ -2213,8 +2229,9 @@ > { > unsigned hash; > struct rtable *rth; > + u8 tos = flp->fl4_tos & (iptos_rt_mask | RTO_ONLINK); > > - hash = rt_hash_code(flp->fl4_dst, flp->fl4_src ^ (flp->oif << 5), flp->fl4_tos); > + hash = rt_hash_code(flp->fl4_dst, flp->fl4_src, tos); > > rcu_read_lock(); > for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) { > @@ -2226,8 +2243,7 @@ > #ifdef CONFIG_IP_ROUTE_FWMARK > rth->fl.fl4_fwmark == flp->fl4_fwmark && > #endif > - !((rth->fl.fl4_tos ^ flp->fl4_tos) & > - (IPTOS_RT_MASK | RTO_ONLINK))) { > + rth->fl.fl4_tos == tos) { > rth->u.dst.lastuse = jiffies; > dst_hold(&rth->u.dst); > rth->u.dst.__use++; > @@ -2479,6 +2495,26 @@ > } > > #ifdef CONFIG_SYSCTL > + > +static int ipv4_sysctl_doint_strategy(ctl_table *table, int *name, > + int nlen, void *oldval, > + size_t *oldlenp, void *newval, > + size_t newlen, void **context) > +{ > + int val; > + > + if (!newval || !newlen) > + return 0; > + if (newlen != sizeof(int)) > + return -EINVAL; > + if (get_user(val, (int *)newval)) > + return -EFAULT; > + if (val == *(int *) table->data) > + return 0; > + *(int *) table->data = val; > + return 1; > +} > + > static int flush_delay; > > static int ipv4_sysctl_rtcache_flush(ctl_table *ctl, int write, > @@ -2508,6 +2544,53 @@ > return 0; > } > > +#ifdef CONFIG_IP_ROUTE_TOS > + > +static void do_ignore_tos(void) > +{ > + iptos_rt_mask = ip_rt_ignore_tos? 0 : IPTOS_RT_MASK; > + rt_cache_flush(0); > +} > + > +#endif > + > +static int ip_rt_ignore_tos_handler(ctl_table *ctl, int write, > + struct file *filp, void *buffer, > + size_t *lenp) > +{ > + if (write) { > +#ifdef CONFIG_IP_ROUTE_TOS > + int old = ip_rt_ignore_tos; > + int ret = proc_dointvec(ctl, write, filp, buffer, lenp); > + > + if (ret) > + return ret; > + if (old != ip_rt_ignore_tos) do_ignore_tos(); > + return 0; > +#else > + return -EINVAL; > +#endif > + } > + return proc_dointvec(ctl, write, filp, buffer, lenp); > +} > +static int ipv4_sysctl_ignore_tos_strategy(ctl_table *table, int *name, > + int nlen, void *oldval, > + size_t *oldlenp, void *newval, > + size_t newlen, void **context) > +{ > +#ifdef CONFIG_IP_ROUTE_TOS > + int ret = ipv4_sysctl_doint_strategy(table, name, nlen, oldval, oldlenp, > + newval, newlen, context); > + > + if (1 != ret) > + return ret; > + do_ignore_tos(); > + return 0; > +#else > + return (newval || newlen) ? -EINVAL : 0; > +#endif > +} > + > ctl_table ipv4_route_table[] = { > { > .ctl_name = NET_IPV4_ROUTE_FLUSH, > @@ -2660,6 +2743,23 @@ > .mode = 0644, > .proc_handler = &proc_dointvec_jiffies, > .strategy = &sysctl_jiffies, > + }, > + { > + .ctl_name = NET_IPV4_ROUTE_IGNORE_TOS, > + .procname = "ignore_tos", > + .data = &ip_rt_ignore_tos, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = &ip_rt_ignore_tos_handler, > + .strategy = &ipv4_sysctl_ignore_tos_strategy, > + }, > + { > + .ctl_name = NET_IPV4_ROUTE_PMTU_MODE, > + .procname = "pmtu_mode", > + .data = &ip_rt_pmtu_mode, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = &proc_dointvec, > }, > { .ctl_name = 0 } > }; > > Regards > > -- > Julian Anastasov <ja@ssi.bg> > -- Kevin W. Rudd Linux Change Team IBM Global Services 1-800-426-7378, T/L 775-4161 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-12 23:38 ` Julian Anastasov 2003-12-18 23:17 ` Kevin W. Rudd @ 2004-01-19 22:43 ` Kevin W. Rudd 2004-01-20 4:29 ` David S. Miller 1 sibling, 1 reply; 31+ messages in thread From: Kevin W. Rudd @ 2004-01-19 22:43 UTC (permalink / raw) To: netdev Does anyone know the status of this potential patch? There has been silence since the initial patch proposal was made over a month ago. Thanks, -Kevin -- Kevin W. Rudd Linux Change Team IBM Global Services 1-800-426-7378, T/L 775-4161 On Sat, 13 Dec 2003, Julian Anastasov wrote: > Date: Sat, 13 Dec 2003 01:38:00 +0200 (EET) > From: Julian Anastasov <ja@ssi.bg> > To: David S. Miller <davem@redhat.com> > Cc: niv@us.ibm.com, ak@suse.de, ruddk@us.ibm.com, kuznet@ms2.inr.ac.ru, > netdev@oss.sgi.com, chester.f.johnson@intel.com > Subject: Re: PMTU issues due to TOS field manipulation (for DSCP) > > > Hello, > > On Fri, 12 Dec 2003, David S. Miller wrote: > > > 1) PMTU processing applies PMTU change to all TOS'd instances of > > a route. This behavior change is sysctl controllable, and > > on by default. > > > > The implementation is to just lookup all 8 possible TOS values. > > > > 2) Whether TOS is a routing cache hash key is controlled by another > > sysctl. > > > > When CONFIG_IP_ROUTE_TOS is set this sysctl defaults to on, other- > > wise it defaults to off. > > > > I think #2 should be very safe because fib node fn_tos values are only > > ever set when that config variable is enabled, and fib rule r_tos values > > are only compared on lookup when it is enabled as well. However, there > > could be a few more ifdefs added to the fib rule code to cover all the > > assignment cases too but let's not worry about that right now. > > OK, here is a change that compiles. No fib changes yet (it > is the next step). I'm not sure what var names are better. > > > Comments? > > What about ip_rt_redirect? May be we need the same TOS > matching behavior if TOS is changed somewhere after routing? > > diff -Nru a/include/linux/sysctl.h b/include/linux/sysctl.h > --- a/include/linux/sysctl.h Sat Dec 13 01:16:15 2003 > +++ b/include/linux/sysctl.h Sat Dec 13 01:16:15 2003 > @@ -329,6 +329,8 @@ > NET_IPV4_ROUTE_MIN_PMTU=16, > NET_IPV4_ROUTE_MIN_ADVMSS=17, > NET_IPV4_ROUTE_SECRET_INTERVAL=18, > + NET_IPV4_ROUTE_IGNORE_TOS=19, > + NET_IPV4_ROUTE_PMTU_MODE=20, > }; > > enum > diff -Nru a/net/ipv4/route.c b/net/ipv4/route.c > --- a/net/ipv4/route.c Sat Dec 13 01:16:15 2003 > +++ b/net/ipv4/route.c Sat Dec 13 01:16:15 2003 > @@ -124,6 +124,14 @@ > int ip_rt_min_advmss = 256; > int ip_rt_secret_interval = 10 * 60 * HZ; > static unsigned long rt_deadline; > +#ifdef CONFIG_IP_ROUTE_TOS > +static u8 iptos_rt_mask = IPTOS_RT_MASK; > +int ip_rt_ignore_tos; > +#else > +static u8 iptos_rt_mask; > +int ip_rt_ignore_tos = 1; > +#endif > +int ip_rt_pmtu_mode; /* 1=match by iph->tos, 0=ignore TOS for PMTU */ > > #define RTprint(a...) printk(KERN_DEBUG a) > > @@ -967,13 +975,12 @@ > void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw, > u32 saddr, u8 tos, struct net_device *dev) > { > - int i, k; > + int i; > struct in_device *in_dev = in_dev_get(dev); > struct rtable *rth, **rthp; > u32 skeys[2] = { saddr, 0 }; > - int ikeys[2] = { dev->ifindex, 0 }; > > - tos &= IPTOS_RT_MASK; > + tos &= iptos_rt_mask; > > if (!in_dev) > return; > @@ -993,10 +1000,7 @@ > } > > for (i = 0; i < 2; i++) { > - for (k = 0; k < 2; k++) { > - unsigned hash = rt_hash_code(daddr, > - skeys[i] ^ (ikeys[k] << 5), > - tos); > + unsigned hash = rt_hash_code(daddr, skeys[i], tos); > > rthp=&rt_hash_table[hash].chain; > > @@ -1008,7 +1012,9 @@ > if (rth->fl.fl4_dst != daddr || > rth->fl.fl4_src != skeys[i] || > rth->fl.fl4_tos != tos || > - rth->fl.oif != ikeys[k] || > + (rth->fl.oif && > + rth->fl.oif != dev->ifindex) || > + rth->rt_dst == rth->rt_gateway || > rth->fl.iif != 0) { > rthp = &rth->u.rt_next; > continue; > @@ -1075,7 +1081,6 @@ > rcu_read_unlock(); > do_next: > ; > - } > } > in_dev_put(in_dev); > return; > @@ -1105,8 +1110,7 @@ > } else if ((rt->rt_flags & RTCF_REDIRECTED) || > rt->u.dst.expires) { > unsigned hash = rt_hash_code(rt->fl.fl4_dst, > - rt->fl.fl4_src ^ > - (rt->fl.oif << 5), > + rt->fl.fl4_src, > rt->fl.fl4_tos); > #if RT_CACHE_DEBUG >= 1 > printk(KERN_DEBUG "ip_rt_advice: redirect to " > @@ -1237,21 +1241,33 @@ > return 68; > } > > +/* See IPTOS_RT_MASK */ > +static u8 all_tos_values[8] = { 0x00, 0x04, 0x08, 0x0C, 0x10, 0x14, 0x18, 0x1C }; > + > unsigned short ip_rt_frag_needed(struct iphdr *iph, unsigned short new_mtu) > { > - int i; > + int i, j, ntos; > unsigned short old_mtu = ntohs(iph->tot_len); > struct rtable *rth; > u32 skeys[2] = { iph->saddr, 0, }; > u32 daddr = iph->daddr; > - u8 tos = iph->tos & IPTOS_RT_MASK; > + u8 *tos_values, tos = iph->tos & iptos_rt_mask; > unsigned short est_mtu = 0; > > if (ipv4_config.no_pmtu_disc) > return 0; > > + if (ip_rt_pmtu_mode || !iptos_rt_mask) { > + tos_values = &tos; > + ntos = 1; > + } else { > + tos_values = all_tos_values; > + ntos = ARRAY_SIZE(all_tos_values); > + } > + > + for (j = 0; j < ntos; j++) > for (i = 0; i < 2; i++) { > - unsigned hash = rt_hash_code(daddr, skeys[i], tos); > + unsigned hash = rt_hash_code(daddr, skeys[i], tos_values[j]); > > rcu_read_lock(); > for (rth = rt_hash_table[hash].chain; rth; > @@ -1259,9 +1275,9 @@ > smp_read_barrier_depends(); > if (rth->fl.fl4_dst == daddr && > rth->fl.fl4_src == skeys[i] && > + rth->fl.fl4_tos == tos_values[j] && > rth->rt_dst == daddr && > rth->rt_src == iph->saddr && > - rth->fl.fl4_tos == tos && > rth->fl.iif == 0 && > !(dst_metric_locked(&rth->u.dst, RTAX_MTU))) { > unsigned short mtu = new_mtu; > @@ -1503,7 +1519,7 @@ > RT_CACHE_STAT_INC(in_slow_mc); > > in_dev_put(in_dev); > - hash = rt_hash_code(daddr, saddr ^ (dev->ifindex << 5), tos); > + hash = rt_hash_code(daddr, saddr, tos); > return rt_intern_hash(hash, rth, (struct rtable**) &skb->dst); > > e_nobufs: > @@ -1554,7 +1570,7 @@ > if (!in_dev) > goto out; > > - hash = rt_hash_code(daddr, saddr ^ (fl.iif << 5), tos); > + hash = rt_hash_code(daddr, saddr, tos); > > /* Check for the most weird martians, which can be not detected > by fib_lookup. > @@ -1846,8 +1862,8 @@ > unsigned hash; > int iif = dev->ifindex; > > - tos &= IPTOS_RT_MASK; > - hash = rt_hash_code(daddr, saddr ^ (iif << 5), tos); > + tos &= iptos_rt_mask; > + hash = rt_hash_code(daddr, saddr, tos); > > rcu_read_lock(); > for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) { > @@ -1912,11 +1928,11 @@ > > int ip_route_output_slow(struct rtable **rp, const struct flowi *oldflp) > { > - u32 tos = oldflp->fl4_tos & (IPTOS_RT_MASK | RTO_ONLINK); > + u32 tos = oldflp->fl4_tos & (iptos_rt_mask | RTO_ONLINK); > struct flowi fl = { .nl_u = { .ip4_u = > { .daddr = oldflp->fl4_dst, > .saddr = oldflp->fl4_src, > - .tos = tos & IPTOS_RT_MASK, > + .tos = tos & iptos_rt_mask, > .scope = ((tos & RTO_ONLINK) ? > RT_SCOPE_LINK : > RT_SCOPE_UNIVERSE), > @@ -2190,7 +2206,7 @@ > > rth->rt_flags = flags; > > - hash = rt_hash_code(oldflp->fl4_dst, oldflp->fl4_src ^ (oldflp->oif << 5), tos); > + hash = rt_hash_code(oldflp->fl4_dst, oldflp->fl4_src, tos); > err = rt_intern_hash(hash, rth, rp); > done: > if (free_res) > @@ -2213,8 +2229,9 @@ > { > unsigned hash; > struct rtable *rth; > + u8 tos = flp->fl4_tos & (iptos_rt_mask | RTO_ONLINK); > > - hash = rt_hash_code(flp->fl4_dst, flp->fl4_src ^ (flp->oif << 5), flp->fl4_tos); > + hash = rt_hash_code(flp->fl4_dst, flp->fl4_src, tos); > > rcu_read_lock(); > for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) { > @@ -2226,8 +2243,7 @@ > #ifdef CONFIG_IP_ROUTE_FWMARK > rth->fl.fl4_fwmark == flp->fl4_fwmark && > #endif > - !((rth->fl.fl4_tos ^ flp->fl4_tos) & > - (IPTOS_RT_MASK | RTO_ONLINK))) { > + rth->fl.fl4_tos == tos) { > rth->u.dst.lastuse = jiffies; > dst_hold(&rth->u.dst); > rth->u.dst.__use++; > @@ -2479,6 +2495,26 @@ > } > > #ifdef CONFIG_SYSCTL > + > +static int ipv4_sysctl_doint_strategy(ctl_table *table, int *name, > + int nlen, void *oldval, > + size_t *oldlenp, void *newval, > + size_t newlen, void **context) > +{ > + int val; > + > + if (!newval || !newlen) > + return 0; > + if (newlen != sizeof(int)) > + return -EINVAL; > + if (get_user(val, (int *)newval)) > + return -EFAULT; > + if (val == *(int *) table->data) > + return 0; > + *(int *) table->data = val; > + return 1; > +} > + > static int flush_delay; > > static int ipv4_sysctl_rtcache_flush(ctl_table *ctl, int write, > @@ -2508,6 +2544,53 @@ > return 0; > } > > +#ifdef CONFIG_IP_ROUTE_TOS > + > +static void do_ignore_tos(void) > +{ > + iptos_rt_mask = ip_rt_ignore_tos? 0 : IPTOS_RT_MASK; > + rt_cache_flush(0); > +} > + > +#endif > + > +static int ip_rt_ignore_tos_handler(ctl_table *ctl, int write, > + struct file *filp, void *buffer, > + size_t *lenp) > +{ > + if (write) { > +#ifdef CONFIG_IP_ROUTE_TOS > + int old = ip_rt_ignore_tos; > + int ret = proc_dointvec(ctl, write, filp, buffer, lenp); > + > + if (ret) > + return ret; > + if (old != ip_rt_ignore_tos) do_ignore_tos(); > + return 0; > +#else > + return -EINVAL; > +#endif > + } > + return proc_dointvec(ctl, write, filp, buffer, lenp); > +} > +static int ipv4_sysctl_ignore_tos_strategy(ctl_table *table, int *name, > + int nlen, void *oldval, > + size_t *oldlenp, void *newval, > + size_t newlen, void **context) > +{ > +#ifdef CONFIG_IP_ROUTE_TOS > + int ret = ipv4_sysctl_doint_strategy(table, name, nlen, oldval, oldlenp, > + newval, newlen, context); > + > + if (1 != ret) > + return ret; > + do_ignore_tos(); > + return 0; > +#else > + return (newval || newlen) ? -EINVAL : 0; > +#endif > +} > + > ctl_table ipv4_route_table[] = { > { > .ctl_name = NET_IPV4_ROUTE_FLUSH, > @@ -2660,6 +2743,23 @@ > .mode = 0644, > .proc_handler = &proc_dointvec_jiffies, > .strategy = &sysctl_jiffies, > + }, > + { > + .ctl_name = NET_IPV4_ROUTE_IGNORE_TOS, > + .procname = "ignore_tos", > + .data = &ip_rt_ignore_tos, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = &ip_rt_ignore_tos_handler, > + .strategy = &ipv4_sysctl_ignore_tos_strategy, > + }, > + { > + .ctl_name = NET_IPV4_ROUTE_PMTU_MODE, > + .procname = "pmtu_mode", > + .data = &ip_rt_pmtu_mode, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = &proc_dointvec, > }, > { .ctl_name = 0 } > }; > > Regards > > -- > Julian Anastasov <ja@ssi.bg> > > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2004-01-19 22:43 ` Kevin W. Rudd @ 2004-01-20 4:29 ` David S. Miller 0 siblings, 0 replies; 31+ messages in thread From: David S. Miller @ 2004-01-20 4:29 UTC (permalink / raw) To: Kevin W. Rudd; +Cc: netdev On Mon, 19 Jan 2004 14:43:50 -0800 (PST) "Kevin W. Rudd" <ruddk@us.ibm.com> wrote: > Does anyone know the status of this potential patch? There has been > silence since the initial patch proposal was made over a month ago. It's still in my backlog, I'll get to it as time permits, I can promise nothing else. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-12 8:31 ` David S. Miller 2003-12-12 23:38 ` Julian Anastasov @ 2003-12-13 0:10 ` Julian Anastasov 2004-03-04 9:36 ` David S. Miller 1 sibling, 1 reply; 31+ messages in thread From: Julian Anastasov @ 2003-12-13 0:10 UTC (permalink / raw) To: David S. Miller; +Cc: niv, ak, ruddk, kuznet, netdev, chester.f.johnson Hello, On Fri, 12 Dec 2003, David S. Miller wrote: > I think #2 should be very safe because fib node fn_tos values are only > ever set when that config variable is enabled, and fib rule r_tos values > are only compared on lookup when it is enabled as well. However, there > could be a few more ifdefs added to the fib rule code to cover all the > assignment cases too but let's not worry about that right now. It seems the FIB changes are small, only for the lookups. I didn't changed the other places, better to allow deletion by tos because the flag can be changed at run time. diff -Nru a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c --- a/net/ipv4/fib_hash.c Sat Dec 13 02:04:59 2003 +++ b/net/ipv4/fib_hash.c Sat Dec 13 02:04:59 2003 @@ -48,6 +48,8 @@ printk(KERN_DEBUG a) */ +extern int ip_rt_ignore_tos; + static kmem_cache_t * fn_hash_kmem; /* @@ -309,7 +311,7 @@ continue; } #ifdef CONFIG_IP_ROUTE_TOS - if (f->fn_tos && f->fn_tos != flp->fl4_tos) + if (f->fn_tos && f->fn_tos != flp->fl4_tos && !ip_rt_ignore_tos) continue; #endif f->fn_state |= FN_S_ACCESSED; diff -Nru a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c --- a/net/ipv4/fib_rules.c Sat Dec 13 02:04:59 2003 +++ b/net/ipv4/fib_rules.c Sat Dec 13 02:04:59 2003 @@ -49,6 +49,8 @@ #define FRprintk(a...) +extern int ip_rt_ignore_tos; + struct fib_rule { struct fib_rule *r_next; @@ -323,7 +325,7 @@ if (((saddr^r->r_src) & r->r_srcmask) || ((daddr^r->r_dst) & r->r_dstmask) || #ifdef CONFIG_IP_ROUTE_TOS - (r->r_tos && r->r_tos != flp->fl4_tos) || + (r->r_tos && r->r_tos != flp->fl4_tos && !ip_rt_ignore_tos) || #endif #ifdef CONFIG_IP_ROUTE_FWMARK (r->r_fwmark && r->r_fwmark != flp->fl4_fwmark) || Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-13 0:10 ` Julian Anastasov @ 2004-03-04 9:36 ` David S. Miller 2004-03-04 20:56 ` Julian Anastasov 0 siblings, 1 reply; 31+ messages in thread From: David S. Miller @ 2004-03-04 9:36 UTC (permalink / raw) To: Julian Anastasov; +Cc: niv, ak, ruddk, kuznet, netdev, chester.f.johnson Julian, I'd like to bring this back to life. Can you post a new version of the ip_rt_ignore_tos patch against current 2.6.x? We'll discuss and integrate just like for the ARP stuff. Sorry for sitting on this for so long, been busy... ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2004-03-04 9:36 ` David S. Miller @ 2004-03-04 20:56 ` Julian Anastasov 2004-03-04 22:02 ` kuznet 0 siblings, 1 reply; 31+ messages in thread From: Julian Anastasov @ 2004-03-04 20:56 UTC (permalink / raw) To: David S. Miller; +Cc: niv, ak, ruddk, kuznet, netdev, chester.f.johnson Hello, On Thu, 4 Mar 2004, David S. Miller wrote: > Julian, I'd like to bring this back to life. Can you post a new version > of the ip_rt_ignore_tos patch against current 2.6.x? Done. Essentially, this patch combines the last posted two diff files: http://www.ssi.bg/~ja/tmp/tos-6.diff > We'll discuss and integrate just like for the ARP stuff. > > Sorry for sitting on this for so long, been busy... np, the remaining questions are: - do we need to walk all tos values for ip_rt_redirect in the same way as for ip_rt_frag_needed, if yes, how to name this flag? icmp_redirect_tos_mode? match_tos_on_redirect? May be pmtu_mode can be renamed to match_tos_on_pmtu? - from another thread: whether ICMP redirects modify only routes via gateway when shared_media is ON: http://marc.theaimsgroup.com/?l=linux-netdev&m=107109827516060&w=2 At the final I'll change some doc files to explain the new flags. Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2004-03-04 20:56 ` Julian Anastasov @ 2004-03-04 22:02 ` kuznet 2004-03-06 11:55 ` Julian Anastasov 2004-03-06 16:02 ` Julian Anastasov 0 siblings, 2 replies; 31+ messages in thread From: kuznet @ 2004-03-04 22:02 UTC (permalink / raw) To: Julian Anastasov; +Cc: davem, niv, ak, ruddk, netdev, chester.f.johnson Hello! > - do we need to walk all tos values for ip_rt_redirect in the same > way as for ip_rt_frag_needed, Well, it is just the same thing (except for one thing, that ignored redirects are harmless) > - from another thread: whether ICMP redirects modify only > routes via gateway when shared_media is ON: > > http://marc.theaimsgroup.com/?l=linux-netdev&m=107109827516060&w=2 "message but we are sure we hit the target IP directly" You cannot be sure, actually. This happens and resolves the situation when the things sort ip route add default dev eth0 are used i.e. host does not know real prefixes. If this is a security issue (I do not see actually, the things on link can be screwed via proxy arp et all in any case), make it a separate option or even better use IN_DEV_SEC_REDIRECTS(in_dev) like similar paranoid case for !shared_media case. Alexey ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2004-03-04 22:02 ` kuznet @ 2004-03-06 11:55 ` Julian Anastasov 2004-03-06 16:02 ` Julian Anastasov 1 sibling, 0 replies; 31+ messages in thread From: Julian Anastasov @ 2004-03-06 11:55 UTC (permalink / raw) To: kuznet; +Cc: davem, niv, ak, ruddk, netdev, chester.f.johnson Hello, On Fri, 5 Mar 2004 kuznet@ms2.inr.ac.ru wrote: > > routes via gateway when shared_media is ON: > > > > http://marc.theaimsgroup.com/?l=linux-netdev&m=107109827516060&w=2 > > "message but we are sure we hit the target IP directly" > > You cannot be sure, actually. This happens and resolves the situation > when the things sort ip route add default dev eth0 are used i.e. host > does not know real prefixes. > > If this is a security issue (I do not see actually, the things on link > can be screwed via proxy arp et all in any case), make it a separate option > or even better use IN_DEV_SEC_REDIRECTS(in_dev) like similar paranoid case > for !shared_media case. I now see, may be better to stay as before, IN_DEV_SEC_REDIRECTS if used, can break the shared_media feature. Anyways, I prepared a final version: http://www.ssi.bg/~ja/tmp/tos-8.diff It passes simple tests. I hope it is ready for inclusion after eventual tuning. Compared to previous versions I removed the 'rth->rt_dst == rth->rt_gateway' check for redirects and renamed the flags. Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2004-03-04 22:02 ` kuznet 2004-03-06 11:55 ` Julian Anastasov @ 2004-03-06 16:02 ` Julian Anastasov 1 sibling, 0 replies; 31+ messages in thread From: Julian Anastasov @ 2004-03-06 16:02 UTC (permalink / raw) To: kuznet; +Cc: davem, niv, ak, ruddk, netdev, chester.f.johnson Hello, On Fri, 5 Mar 2004 kuznet@ms2.inr.ac.ru wrote: > > - do we need to walk all tos values for ip_rt_redirect in the same > > way as for ip_rt_frag_needed, > > Well, it is just the same thing (except for one thing, that ignored > redirects are harmless) A related 3-hunk change (on top of tos-8.diff), more eyes are needed. The 2nd hunk tries harder to propagate the new_gw. It seems this 2nd hunk is needed because the first list of checks does not compare all possible keys and we can not simply exit the hash chain in the middle. This problem exists also in current kernels, for example, when fwmark is used. As for hunks 1 and the 3, they are needed if we want to propagate the new_gw value to all entries in the hash chain with the cost of some cycles. If we do not apply 1 and 3 then we will update the next entries in the chain on another redirect message. do_next is still used to avoid deadloops in a chain, do_repeat is used only on successful update. Comments? Regards -- Julian Anastasov <ja@ssi.bg> --- linux-tos-8/net/ipv4/route.c 2004-03-06 13:24:03.000000000 +0200 +++ linux/net/ipv4/route.c 2004-03-06 17:28:34.572673040 +0200 @@ -1018,6 +1018,8 @@ for (i = 0; i < 2; i++) { unsigned hash = rt_hash_code(daddr, skeys[i], tos_values[j]); + do_repeat: + rthp=&rt_hash_table[hash].chain; rcu_read_lock(); @@ -1039,8 +1041,10 @@ rth->rt_src != saddr || rth->u.dst.error || rth->rt_gateway != old_gw || - rth->u.dst.dev != dev) - break; + rth->u.dst.dev != dev) { + rthp = &rth->u.rt_next; + continue; + } dst_hold(&rth->u.dst); rcu_read_unlock(); @@ -1089,8 +1093,10 @@ } rt_del(hash, rth); - if (!rt_intern_hash(hash, rt, &rt)) + if (!rt_intern_hash(hash, rt, &rt)) { ip_rt_put(rt); + goto do_repeat; + } goto do_next; } rcu_read_unlock(); ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-10 19:34 ` Andi Kleen 2003-12-10 20:20 ` Julian Anastasov @ 2003-12-10 20:26 ` Kevin W. Rudd 2003-12-10 20:52 ` Andi Kleen 2003-12-10 20:30 ` Nivedita Singhvi 2 siblings, 1 reply; 31+ messages in thread From: Kevin W. Rudd @ 2003-12-10 20:26 UTC (permalink / raw) To: Andi Kleen; +Cc: davem, kuznet, netdev, chester.f.johnson I don't think it's practical to make these types of changes at the router level. Discounting not having the ability to change the IOS behavior, it is quite probable that the router doing the marking is beyond your control (e.g. at the ISP level). The only network users that would have to make this change would be those taking advantage of a network using QoS marking (and a smaller MTU somewhere in the path). That's why the ideal short term solution would be to allow the manipulation of this behavior via a sysctl option/variable, but keep the default the same as it is now. -Kevin On Wed, 10 Dec 2003, Andi Kleen wrote: > Subject: Re: PMTU issues due to TOS field manipulation (for DSCP) > > On Wed, 10 Dec 2003 10:23:15 -0800 (PST) > "Kevin W. Rudd" <ruddk@us.ibm.com> wrote: > > > > > Thoughts? Comments? > > I don't think the network users will be very happy if you require changing all > end hosts this way. How about the following hack? For DF=1 packets > the ipid field is useless. When you rewrite the TOS to DSCP save > the old TOS in the ipid field. When you see an ICMP fragment required > message with the right DSCP on the router restore the old TOS from the ipid > field. > > One drawback is that there are a few VJ header compression servers that > require changing ipid. But you can still handle these by not overwriting > all bits of ipid so that it still changes (keep a few changing bits) > > -Andi > -- Kevin W. Rudd Linux Change Team IBM Global Services 1-800-426-7378, T/L 775-4161 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-10 20:26 ` Kevin W. Rudd @ 2003-12-10 20:52 ` Andi Kleen 0 siblings, 0 replies; 31+ messages in thread From: Andi Kleen @ 2003-12-10 20:52 UTC (permalink / raw) To: Kevin W. Rudd; +Cc: davem, kuznet, netdev, chester.f.johnson On Wed, 10 Dec 2003 12:26:26 -0800 (PST) "Kevin W. Rudd" <ruddk@us.ibm.com> wrote: > I don't think it's practical to make these types of changes at the > router level. Just changing all end hosts for this will be likely even less practical because there are much more of them than routers. -Andi ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-10 19:34 ` Andi Kleen 2003-12-10 20:20 ` Julian Anastasov 2003-12-10 20:26 ` Kevin W. Rudd @ 2003-12-10 20:30 ` Nivedita Singhvi 2003-12-10 20:55 ` Andi Kleen 2 siblings, 1 reply; 31+ messages in thread From: Nivedita Singhvi @ 2003-12-10 20:30 UTC (permalink / raw) To: Andi Kleen; +Cc: Kevin W. Rudd, davem, kuznet, netdev, chester.f.johnson Andi Kleen wrote: > I don't think the network users will be very happy if you require changing all > end hosts this way. How about the following hack? For DF=1 packets > the ipid field is useless. When you rewrite the TOS to DSCP save > the old TOS in the ipid field. When you see an ICMP fragment required > message with the right DSCP on the router restore the old TOS from the ipid > field. Wouldnt this require changes at the both ends, router and host? How would we sync? thanks, Nivedita ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-10 20:30 ` Nivedita Singhvi @ 2003-12-10 20:55 ` Andi Kleen 2003-12-10 21:11 ` Nivedita Singhvi 0 siblings, 1 reply; 31+ messages in thread From: Andi Kleen @ 2003-12-10 20:55 UTC (permalink / raw) To: Nivedita Singhvi; +Cc: ruddk, netdev, chester.f.johnson On Wed, 10 Dec 2003 12:30:39 -0800 Nivedita Singhvi <niv@us.ibm.com> wrote: > > I don't think the network users will be very happy if you require changing all > > end hosts this way. How about the following hack? For DF=1 packets > > the ipid field is useless. When you rewrite the TOS to DSCP save > > the old TOS in the ipid field. When you see an ICMP fragment required > > message with the right DSCP on the router restore the old TOS from the ipid > > field. > > Wouldnt this require changes at the both ends, router and > host? How would we sync? Only the router would need to change (and rewrite ICMP messages, which is a bit nasty, but then compatibility is not always fun) -Andi P.S.: I am not opposed to fixing linux for this, just I have my doubts that fixing all end hosts is a practical solution for the problem. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-10 20:55 ` Andi Kleen @ 2003-12-10 21:11 ` Nivedita Singhvi 0 siblings, 0 replies; 31+ messages in thread From: Nivedita Singhvi @ 2003-12-10 21:11 UTC (permalink / raw) To: Andi Kleen; +Cc: ruddk, netdev, chester.f.johnson Andi Kleen wrote: > Only the router would need to change (and rewrite ICMP messages, which is a > bit nasty, but then compatibility is not always fun) True, but I get a sinking feeling when I hear we have to change router/icmp code. The word "only" seems inappropriate ;). > -Andi > > P.S.: I am not opposed to fixing linux for this, just I have my doubts > that fixing all end hosts is a practical solution for the problem. The reason it seems simpler to me, at least, and I don't speak for Kevin or the person at Intel who first reported the problem, is that I see only a few people (well, only Intel :)) running into the problem. So if there were a patch available or the fix were in some release, only the few end hosts who actually needed it would need to upgrade. Otherwise, rollout would be evolutionary. But also open to other approaches... thanks, Nivedita ^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: PMTU issues due to TOS field manipulation (for DSCP) @ 2003-12-10 21:35 Johnson, Chester F 0 siblings, 0 replies; 31+ messages in thread From: Johnson, Chester F @ 2003-12-10 21:35 UTC (permalink / raw) To: Andi Kleen, Nivedita Singhvi; +Cc: ruddk, netdev >From a corporate network perspective there are a couple of options that can temporarily work around the problem. As an example, support for jumbo frames on Ethernet IP-Sec interfaces. This would require some hardware replacement, but GigE generally supports this. On the other hand, I'm not sure I want to comprehend IP-Sec at GigE line rates. We are pretty early in deploying diff-serv compliant QoS at this scale. We are also one of few companies that mandate encrypted private WAN links (Only the paranoid survive). As we look forward, we would expect more frequent appearance of this behavior at other companies. All that said, an evolutionary roll-out is OK. Knowing that there is a target release that would resolve the issue gives us something to look forward to. It may also be a good idea to submit some clarification in the relevant RFCs. chet -----Original Message----- From: Andi Kleen [mailto:ak@suse.de] Sent: Wednesday, December 10, 2003 12:55 PM To: Nivedita Singhvi Cc: ruddk@us.ibm.com; netdev@oss.sgi.com; Johnson, Chester F Subject: Re: PMTU issues due to TOS field manipulation (for DSCP) On Wed, 10 Dec 2003 12:30:39 -0800 Nivedita Singhvi <niv@us.ibm.com> wrote: > > I don't think the network users will be very happy if you require > > changing all end hosts this way. How about the following hack? For > > DF=1 packets the ipid field is useless. When you rewrite the TOS to > > DSCP save the old TOS in the ipid field. When you see an ICMP > > fragment required message with the right DSCP on the router restore > > the old TOS from the ipid field. > > Wouldnt this require changes at the both ends, router and host? How > would we sync? Only the router would need to change (and rewrite ICMP messages, which is a bit nasty, but then compatibility is not always fun) -Andi P.S.: I am not opposed to fixing linux for this, just I have my doubts that fixing all end hosts is a practical solution for the problem. ^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: PMTU issues due to TOS field manipulation (for DSCP)
@ 2003-12-10 23:36 Johnson, Chester F
2003-12-11 0:17 ` Julian Anastasov
0 siblings, 1 reply; 31+ messages in thread
From: Johnson, Chester F @ 2003-12-10 23:36 UTC (permalink / raw)
To: David S. Miller, Julian Anastasov; +Cc: niv, ak, ruddk, kuznet, netdev
I'm trying to think through one potential issue with ignoring the TOS
field. Would/should a route redirect have to match the same src-dst-tos
hash?
1. Assume the network uses DSCP to make routing decisions.
2. Assume the first hop routers are configured as passive and do not
advertise routes to the supported subnets and presumes static routes
configured on end hosts.
3. The Linux host has multiple network interfaces for route diversity.
The preferred route may send the frame to the wrong router and then
ignore the icmp route redirect. Or would it?
chet
-----Original Message-----
From: David S. Miller [mailto:davem@redhat.com]
Sent: Wednesday, December 10, 2003 3:21 PM
To: Julian Anastasov
Cc: niv@us.ibm.com; ak@suse.de; ruddk@us.ibm.com; kuznet@ms2.inr.ac.ru;
netdev@oss.sgi.com; Johnson, Chester F
Subject: Re: PMTU issues due to TOS field manipulation (for DSCP)
On Thu, 11 Dec 2003 01:15:06 +0200 (EET)
Julian Anastasov <ja@ssi.bg> wrote:
> It seems, there are no many users of OIF!=0 but if TOS is used as
> routing key we can see up to 8 entries with different TOS for same
> SADDR,DADDR. Of course, it looks difficult to walk 8 rows just to
> check all TOS variants, the common case is to see only one TOS value
> used. That is why I propose to eliminate the TOS as hash key and to
> walk one row. At first look, the risk of DoS is same, thanks to the
> random value.
This is not my understanding at all.
Consider the case where we generate routing cache entries for all 8 TOS
values. Currently we'll likely get a O(1) lookup for any one of those
entries.
Your proposal guarentees that all such entries will land to the same
hash chain, since TOS is not an input for the hash any longer. Therefore
the lookup in my example case will be O(8).
And instead of just eating the complexity at ICMP PMTU handling time, we
eat the complexity at every routing cache lookup.
I really don't think we can consider this.
^ permalink raw reply [flat|nested] 31+ messages in thread* RE: PMTU issues due to TOS field manipulation (for DSCP) 2003-12-10 23:36 Johnson, Chester F @ 2003-12-11 0:17 ` Julian Anastasov 0 siblings, 0 replies; 31+ messages in thread From: Julian Anastasov @ 2003-12-11 0:17 UTC (permalink / raw) To: Johnson, Chester F; +Cc: David S. Miller, niv, ak, ruddk, kuznet, netdev Hello, On Wed, 10 Dec 2003, Johnson, Chester F wrote: > I'm trying to think through one potential issue with ignoring the TOS > field. Would/should a route redirect have to match the same src-dst-tos > hash? It will surely match as before, as long as you do not change the sysctl var to ignore the TOS. > 1. Assume the network uses DSCP to make routing decisions. > 2. Assume the first hop routers are configured as passive and do not > advertise routes to the supported subnets and presumes static routes > configured on end hosts. > 3. The Linux host has multiple network interfaces for route diversity. > > The preferred route may send the frame to the wrong router and then > ignore the icmp route redirect. Or would it? That is why I prefer 3 states, we will be able to route by TOS (it will not be ignored for routing and cache) but still to ignore TOS for PMTUD. I think, we need one proc entry that will alter 2 int variables: whether TOS is ignored for PMTU and another for the routing TOS mask. As we see, if we ignore TOS and OIF for PMTU, all possible paths (including different TOS and OIF values) between SADDR and DADDR will have same PMTU and we can not solve this problem, we have to update PMTU for all these entries. Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2004-03-06 16:02 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2003-12-10 18:23 PMTU issues due to TOS field manipulation (for DSCP) Kevin W. Rudd 2003-12-10 19:34 ` Andi Kleen 2003-12-10 20:20 ` Julian Anastasov 2003-12-10 20:33 ` Nivedita Singhvi 2003-12-10 21:18 ` Julian Anastasov 2003-12-10 22:36 ` Nivedita Singhvi 2003-12-10 22:51 ` David S. Miller 2003-12-10 23:15 ` Julian Anastasov 2003-12-10 23:20 ` David S. Miller 2003-12-11 0:06 ` Julian Anastasov 2003-12-11 0:09 ` David S. Miller 2003-12-11 0:34 ` Julian Anastasov 2003-12-12 8:31 ` David S. Miller 2003-12-12 23:38 ` Julian Anastasov 2003-12-18 23:17 ` Kevin W. Rudd 2004-01-19 22:43 ` Kevin W. Rudd 2004-01-20 4:29 ` David S. Miller 2003-12-13 0:10 ` Julian Anastasov 2004-03-04 9:36 ` David S. Miller 2004-03-04 20:56 ` Julian Anastasov 2004-03-04 22:02 ` kuznet 2004-03-06 11:55 ` Julian Anastasov 2004-03-06 16:02 ` Julian Anastasov 2003-12-10 20:26 ` Kevin W. Rudd 2003-12-10 20:52 ` Andi Kleen 2003-12-10 20:30 ` Nivedita Singhvi 2003-12-10 20:55 ` Andi Kleen 2003-12-10 21:11 ` Nivedita Singhvi -- strict thread matches above, loose matches on Subject: below -- 2003-12-10 21:35 Johnson, Chester F 2003-12-10 23:36 Johnson, Chester F 2003-12-11 0:17 ` Julian Anastasov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).