netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [ROUTE] PMTU only works on half the time
@ 2003-12-01 20:16 Herbert Xu
  2003-12-01 20:47 ` Herbert Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Herbert Xu @ 2003-12-01 20:16 UTC (permalink / raw)
  To: David S. Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 384 bytes --]

Hi Dave:

I found out that PMTU only works on those routing cache entries where
rt_src != 0.  This patch should make it work for all matching entries.

Cheers,
-- 
Debian GNU/Linux 3.0 is out! ( http://www.debian.org/ )
Email:  Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: p --]
[-- Type: text/plain, Size: 806 bytes --]

Index: kernel-source-2.5/net/ipv4/route.c
===================================================================
RCS file: /home/gondolin/herbert/src/CVS/debian/kernel-source-2.5/net/ipv4/route.c,v
retrieving revision 1.3
diff -u -r1.3 route.c
--- kernel-source-2.5/net/ipv4/route.c	24 Nov 2003 09:52:04 -0000	1.3
+++ kernel-source-2.5/net/ipv4/route.c	1 Dec 2003 20:15:40 -0000
@@ -1259,9 +1259,9 @@
 		     rth = rth->u.rt_next) {
 			smp_read_barrier_depends();
 			if (rth->fl.fl4_dst == daddr &&
-			    rth->fl.fl4_src == skeys[i] &&
+			    (rth->fl.fl4_src == iph->saddr ||
+			     rth->rt_src == iph->saddr) &&
 			    rth->rt_dst  == daddr &&
-			    rth->rt_src  == iph->saddr &&
 			    rth->fl.fl4_tos == tos &&
 			    rth->fl.iif == 0 &&
 			    !(dst_metric_locked(&rth->u.dst, RTAX_MTU))) {

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-01 20:16 [ROUTE] PMTU only works on half the time Herbert Xu
@ 2003-12-01 20:47 ` Herbert Xu
  2003-12-01 21:51   ` David S. Miller
  2003-12-01 23:30   ` Julian Anastasov
  0 siblings, 2 replies; 23+ messages in thread
From: Herbert Xu @ 2003-12-01 20:47 UTC (permalink / raw)
  To: David S. Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 496 bytes --]

On Tue, Dec 02, 2003 at 07:16:51AM +1100, herbert wrote:
> 
> I found out that PMTU only works on those routing cache entries where
> rt_src != 0.  This patch should make it work for all matching entries.

That patch removed one line too many.  This one should be better.
-- 
Debian GNU/Linux 3.0 is out! ( http://www.debian.org/ )
Email:  Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: p --]
[-- Type: text/plain, Size: 774 bytes --]

Index: kernel-source-2.5/net/ipv4/route.c
===================================================================
RCS file: /home/gondolin/herbert/src/CVS/debian/kernel-source-2.5/net/ipv4/route.c,v
retrieving revision 1.3
diff -u -r1.3 route.c
--- kernel-source-2.5/net/ipv4/route.c	24 Nov 2003 09:52:04 -0000	1.3
+++ kernel-source-2.5/net/ipv4/route.c	1 Dec 2003 20:45:22 -0000
@@ -1260,8 +1260,9 @@
 			smp_read_barrier_depends();
 			if (rth->fl.fl4_dst == daddr &&
 			    rth->fl.fl4_src == skeys[i] &&
+			    (rth->fl.fl4_src == iph->saddr ||
+			     rth->rt_src == iph->saddr) &&
 			    rth->rt_dst  == daddr &&
-			    rth->rt_src  == iph->saddr &&
 			    rth->fl.fl4_tos == tos &&
 			    rth->fl.iif == 0 &&
 			    !(dst_metric_locked(&rth->u.dst, RTAX_MTU))) {

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-01 20:47 ` Herbert Xu
@ 2003-12-01 21:51   ` David S. Miller
  2003-12-01 22:05     ` Herbert Xu
  2003-12-01 23:30   ` Julian Anastasov
  1 sibling, 1 reply; 23+ messages in thread
From: David S. Miller @ 2003-12-01 21:51 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On Tue, 2 Dec 2003 07:47:00 +1100
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Tue, Dec 02, 2003 at 07:16:51AM +1100, herbert wrote:
> > 
> > I found out that PMTU only works on those routing cache entries where
> > rt_src != 0.  This patch should make it work for all matching entries.
> 
> That patch removed one line too many.  This one should be better.

Hmmm... 

Herbert, do you see how the outer loop and the skey[] thing works in
this PMTU handling code?  This takes care of comparing both iph->saddr
and '0' against rt->rt_src.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-01 21:51   ` David S. Miller
@ 2003-12-01 22:05     ` Herbert Xu
  2003-12-01 22:21       ` David S. Miller
  0 siblings, 1 reply; 23+ messages in thread
From: Herbert Xu @ 2003-12-01 22:05 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

On Mon, Dec 01, 2003 at 01:51:54PM -0800, David S. Miller wrote:
> 
> Herbert, do you see how the outer loop and the skey[] thing works in
> this PMTU handling code?  This takes care of comparing both iph->saddr
> and '0' against rt->rt_src.

It only takes care of fl4_src, not rt_src.

Cheers,
-- 
Debian GNU/Linux 3.0 is out! ( http://www.debian.org/ )
Email:  Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-01 22:05     ` Herbert Xu
@ 2003-12-01 22:21       ` David S. Miller
  2003-12-01 23:22         ` David S. Miller
  0 siblings, 1 reply; 23+ messages in thread
From: David S. Miller @ 2003-12-01 22:21 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On Tue, 2 Dec 2003 09:05:09 +1100
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Mon, Dec 01, 2003 at 01:51:54PM -0800, David S. Miller wrote:
> > 
> > Herbert, do you see how the outer loop and the skey[] thing works in
> > this PMTU handling code?  This takes care of comparing both iph->saddr
> > and '0' against rt->rt_src.
> 
> It only takes care of fl4_src, not rt_src.

Indeed.

At the surface it looks like a bug, but look at the redirect handling
tests in ip_rt_redirect().  It's a very similar key comparison as
the PMTU code, just structured differently:

            if (rth->fl.fl4_dst != daddr ||
                rth->fl.fl4_src != skeys[i] ||
                rth->fl.fl4_tos != tos ||
                rth->fl.oif != ikeys[k] ||
                rth->fl.iif != 0) {
                    rthp = &rth->u.rt_next;
                    continue;
            }

            if (rth->rt_dst != daddr ||
                rth->rt_src != saddr ||
                rth->u.dst.error ||
                rth->rt_gateway != old_gw ||
                rth->u.dst.dev != dev)
                    break;

See?  He's not comparing rt->rt_src against skeys[] and therefore '0'
here either.

I think the tests might be like this for a reason.

I could see Alexey constructing this test wrong in one instance, but
in two instances where the tests were structured totally different in
each case is hard to believe.

Let me think about this some more, maybe you're right and the
error exists in both of these places.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-01 22:21       ` David S. Miller
@ 2003-12-01 23:22         ` David S. Miller
  2003-12-02 10:10           ` Herbert Xu
  0 siblings, 1 reply; 23+ messages in thread
From: David S. Miller @ 2003-12-01 23:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, netdev

On Mon, 1 Dec 2003 14:21:31 -0800
"David S. Miller" <davem@redhat.com> wrote:

> Let me think about this some more, maybe you're right and the
> error exists in both of these places.

Ok, I did my thinking :)

rt->rt_src is special.  It is the source address we have selected
to use with this route.  All output packets using this route must
use rt->rt_src as iph->saddr.

So, in effect, when we say "if (rt->rt_src == iph->saddr)" we
are asking the question "did we make this packet?"  I think this
is why Alexey coded the test in this way.

You are speaking of a case of zero source addresses.  When would
we output such an iph->saddr, by way of a route?  Right now this
is the only part I'm not seeing.

I want to be careful in changing this code, as loosening the key check
opens the possibility of new kinds of PMTU lowering attacks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-01 20:47 ` Herbert Xu
  2003-12-01 21:51   ` David S. Miller
@ 2003-12-01 23:30   ` Julian Anastasov
  2003-12-01 23:50     ` David S. Miller
  1 sibling, 1 reply; 23+ messages in thread
From: Julian Anastasov @ 2003-12-01 23:30 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev


	Hello,

On Tue, 2 Dec 2003, Herbert Xu wrote:

> On Tue, Dec 02, 2003 at 07:16:51AM +1100, herbert wrote:
> >
> > I found out that PMTU only works on those routing cache entries where
> > rt_src != 0.  This patch should make it work for all matching entries.
>
> That patch removed one line too many.  This one should be better.

	IMO, the rt_src check in ip_rt_frag_needed is ok.

	I would suspect all rth->fl.fl4_tos checks too.
It seems we need toskeys[2] and a second for loop if tos!=0.
What about rewriting them to (rth->fl.fl4_tos == toskeys[j]).

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-01 23:30   ` Julian Anastasov
@ 2003-12-01 23:50     ` David S. Miller
  2003-12-02  0:04       ` Julian Anastasov
                         ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: David S. Miller @ 2003-12-01 23:50 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: herbert, netdev

On Tue, 2 Dec 2003 01:30:17 +0200 (EET)
Julian Anastasov <ja@ssi.bg> wrote:

> 	I would suspect all rth->fl.fl4_tos checks too.
> It seems we need toskeys[2] and a second for loop if tos!=0.
> What about rewriting them to (rth->fl.fl4_tos == toskeys[j]).

I disagree, and this is related to my most recent email
in this thread.

This packet we are reacting to for PMTU purposes could only
have come from us if the TOS matches precisely.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-01 23:50     ` David S. Miller
@ 2003-12-02  0:04       ` Julian Anastasov
  2003-12-02  0:07       ` Julian Anastasov
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Julian Anastasov @ 2003-12-02  0:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, netdev


	Hello,

On Mon, 1 Dec 2003, David S. Miller wrote:

> On Tue, 2 Dec 2003 01:30:17 +0200 (EET)
> Julian Anastasov <ja@ssi.bg> wrote:
>
> > 	I would suspect all rth->fl.fl4_tos checks too.
> > It seems we need toskeys[2] and a second for loop if tos!=0.
> > What about rewriting them to (rth->fl.fl4_tos == toskeys[j]).
>
> I disagree, and this is related to my most recent email
> in this thread.

	No, only input routes match strictly tos, ip_rt_redirect()
is such example that matches input routes.

> This packet we are reacting to for PMTU purposes could only
> have come from us if the TOS matches precisely.

	In this case we can react to packet routed with tos=0.
We match output routes only. I do not see another place that needs such 
fix.

	Example patch (not tested):

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.1350  -> 1.1351 
#	    net/ipv4/route.c	1.73    -> 1.74   
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/12/02	ja@ssi.bg	1.1351
# [IPV4]: ip_rt_frag_needed: fl4_tos accepts wildcard value for output routes
# --------------------------------------------
#
diff -Nru a/net/ipv4/route.c b/net/ipv4/route.c
--- a/net/ipv4/route.c	Tue Dec  2 02:00:54 2003
+++ b/net/ipv4/route.c	Tue Dec  2 02:00:54 2003
@@ -1239,19 +1239,21 @@
 
 unsigned short ip_rt_frag_needed(struct iphdr *iph, unsigned short new_mtu)
 {
-	int i;
+	int i, j;
 	unsigned short old_mtu = ntohs(iph->tot_len);
 	struct rtable *rth;
 	u32  skeys[2] = { iph->saddr, 0, };
 	u32  daddr = iph->daddr;
 	u8   tos = iph->tos & IPTOS_RT_MASK;
 	unsigned short est_mtu = 0;
+	u8 toskeys[2] = { tos, 0 };
 
 	if (ipv4_config.no_pmtu_disc)
 		return 0;
 
+	for (j = 0; j < (tos ? 2 : 1); j++)
 	for (i = 0; i < 2; i++) {
-		unsigned hash = rt_hash_code(daddr, skeys[i], tos);
+		unsigned hash = rt_hash_code(daddr, skeys[i], toskeys[j]);
 
 		rcu_read_lock();
 		for (rth = rt_hash_table[hash].chain; rth;
@@ -1261,7 +1263,7 @@
 			    rth->fl.fl4_src == skeys[i] &&
 			    rth->rt_dst  == daddr &&
 			    rth->rt_src  == iph->saddr &&
-			    rth->fl.fl4_tos == tos &&
+			    rth->fl.fl4_tos == toskeys[j] &&
 			    rth->fl.iif == 0 &&
 			    !(dst_metric_locked(&rth->u.dst, RTAX_MTU))) {
 				unsigned short mtu = new_mtu;

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-01 23:50     ` David S. Miller
  2003-12-02  0:04       ` Julian Anastasov
@ 2003-12-02  0:07       ` Julian Anastasov
  2003-12-02  0:08         ` David S. Miller
  2003-12-02  1:53       ` Julian Anastasov
  2003-12-02 23:40       ` Julian Anastasov
  3 siblings, 1 reply; 23+ messages in thread
From: Julian Anastasov @ 2003-12-02  0:07 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, netdev


	Hello,

On Mon, 1 Dec 2003, David S. Miller wrote:

> On Tue, 2 Dec 2003 01:30:17 +0200 (EET)
> Julian Anastasov <ja@ssi.bg> wrote:
>
> > 	I would suspect all rth->fl.fl4_tos checks too.
> > It seems we need toskeys[2] and a second for loop if tos!=0.
> > What about rewriting them to (rth->fl.fl4_tos == toskeys[j]).
>
> I disagree, and this is related to my most recent email
> in this thread.
>
> This packet we are reacting to for PMTU purposes could only
> have come from us if the TOS matches precisely.

	Ops, ip_rt_redirect matches output route to, hm, lets think
again about it...

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-02  0:07       ` Julian Anastasov
@ 2003-12-02  0:08         ` David S. Miller
  0 siblings, 0 replies; 23+ messages in thread
From: David S. Miller @ 2003-12-02  0:08 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: herbert, netdev

On Tue, 2 Dec 2003 02:07:00 +0200 (EET)
Julian Anastasov <ja@ssi.bg> wrote:

> 
> On Mon, 1 Dec 2003, David S. Miller wrote:
> 
> > This packet we are reacting to for PMTU purposes could only
> > have come from us if the TOS matches precisely.
> 
> 	Ops, ip_rt_redirect matches output route to, hm, lets think
> again about it...

Right.

I've rewritten emails in this thread probably 3 or 4 times
each before actually sending them out, so don't feel bad
since my mistakes have been merely hidden :)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-01 23:50     ` David S. Miller
  2003-12-02  0:04       ` Julian Anastasov
  2003-12-02  0:07       ` Julian Anastasov
@ 2003-12-02  1:53       ` Julian Anastasov
  2003-12-02 23:40       ` Julian Anastasov
  3 siblings, 0 replies; 23+ messages in thread
From: Julian Anastasov @ 2003-12-02  1:53 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, netdev


	Hello,

On Mon, 1 Dec 2003, David S. Miller wrote:

> I disagree, and this is related to my most recent email
> in this thread.
>
> This packet we are reacting to for PMTU purposes could only
> have come from us if the TOS matches precisely.

	Here is what I have for today. I assume all ip_route_output
callers provide valid tos (not a wildcard). As result, only RTO_ONLINK and 
oif have wildcard value. I'm not sure if ip_rt_frag_needed needs an iif 
argument, may be yes?

	Also, it seems ip_rt_redirect needs the 'tos, tos | RTO_ONLINK'
array too as in ip_rt_frag_needed. Not included yet.

	Another problem: it seems __ip_route_output_key does not
hash with valid tos key bits, fix included below:

--- net/ipv4/route.c.orig	Tue Dec  2 03:25:59 2003
+++ net/ipv4/route.c	Tue Dec  2 03:37:27 2003
@@ -1239,19 +1239,25 @@
 
 unsigned short ip_rt_frag_needed(struct iphdr *iph, unsigned short new_mtu)
 {
-	int i;
+	int i, j, k;
 	unsigned short old_mtu = ntohs(iph->tot_len);
 	struct rtable *rth;
 	u32  skeys[2] = { iph->saddr, 0, };
 	u32  daddr = iph->daddr;
 	u8   tos = iph->tos & IPTOS_RT_MASK;
 	unsigned short est_mtu = 0;
+	u8 toskeys[2] = { tos, tos | RTO_ONLINK };
+	int iif = 0; // Can be argument
+	int  ikeys[2] = { iif, 0 };
 
 	if (ipv4_config.no_pmtu_disc)
 		return 0;
 
+	for (k = 0; k < (iif ? 2 : 1); k++)
+	for (j = 0; j < 2; j++)
 	for (i = 0; i < 2; i++) {
-		unsigned hash = rt_hash_code(daddr, skeys[i], tos);
+		unsigned hash = rt_hash_code(daddr, skeys[i] ^ (ikeys[k] << 5),
+					     toskeys[j]);
 
 		rcu_read_lock();
 		for (rth = rt_hash_table[hash].chain; rth;
@@ -1261,7 +1267,8 @@
 			    rth->fl.fl4_src == skeys[i] &&
 			    rth->rt_dst  == daddr &&
 			    rth->rt_src  == iph->saddr &&
-			    rth->fl.fl4_tos == tos &&
+			    rth->fl.fl4_tos == toskeys[j] &&
+			    rth->fl.oif == ikeys[k] &&
 			    rth->fl.iif == 0 &&
 			    !(dst_metric_locked(&rth->u.dst, RTAX_MTU))) {
 				unsigned short mtu = new_mtu;
@@ -2214,7 +2221,8 @@
 	unsigned hash;
 	struct rtable *rth;
 
-	hash = rt_hash_code(flp->fl4_dst, flp->fl4_src ^ (flp->oif << 5), flp->fl4_tos);
+	hash = rt_hash_code(flp->fl4_dst, flp->fl4_src ^ (flp->oif << 5),
+		flp->fl4_tos & (IPTOS_RT_MASK | RTO_ONLINK));
 
 	rcu_read_lock();
 	for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) {

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-01 23:22         ` David S. Miller
@ 2003-12-02 10:10           ` Herbert Xu
  2003-12-02 10:27             ` David S. Miller
  0 siblings, 1 reply; 23+ messages in thread
From: Herbert Xu @ 2003-12-02 10:10 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

On Mon, Dec 01, 2003 at 03:22:15PM -0800, David S. Miller wrote:
> 
> You are speaking of a case of zero source addresses.  When would
> we output such an iph->saddr, by way of a route?  Right now this
> is the only part I'm not seeing.

You're right.  My patch is totally bogus.

I misread the ip(8) output.  I thought that if src wasn't shown that
rt_src must be zero.  But in fact it means that rt_src == fl4_src.

My problem turns out to be that oif != 0 for the outgoing packets.
Since frag_needed only handle cache entries where oif == 0 it
never has a chance to work.

The application that generated these packets is the RPC code in glibc.

Cheers,
-- 
Debian GNU/Linux 3.0 is out! ( http://www.debian.org/ )
Email:  Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-02 10:10           ` Herbert Xu
@ 2003-12-02 10:27             ` David S. Miller
  2003-12-02 10:33               ` Herbert Xu
  0 siblings, 1 reply; 23+ messages in thread
From: David S. Miller @ 2003-12-02 10:27 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On Tue, 2 Dec 2003 21:10:25 +1100
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> My problem turns out to be that oif != 0 for the outgoing packets.
> Since frag_needed only handle cache entries where oif == 0 it
> never has a chance to work.
> 
> The application that generated these packets is the RPC code in glibc.

That behavior of glibc is incorrect, I know about it, and I explained
all this to Uli Drepper some time ago and he fixed it.

Current glibc should not be doing this.  If it still is, since Uli
understood my arguments, it probably just slipped under the rug.
Tell me this so I can take care of it.

What the glibc code was doing was mirroring the input packet parameters
(saddr/daddr/if/etc.) into what it used for the output packet sends.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-02 10:27             ` David S. Miller
@ 2003-12-02 10:33               ` Herbert Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Herbert Xu @ 2003-12-02 10:33 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

On Tue, Dec 02, 2003 at 02:27:33AM -0800, David S. Miller wrote:
> 
> That behavior of glibc is incorrect, I know about it, and I explained
> all this to Uli Drepper some time ago and he fixed it.

Cool.  I can confirm that this is definitely fixed with a more recent
glibc.
-- 
Debian GNU/Linux 3.0 is out! ( http://www.debian.org/ )
Email:  Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-01 23:50     ` David S. Miller
                         ` (2 preceding siblings ...)
  2003-12-02  1:53       ` Julian Anastasov
@ 2003-12-02 23:40       ` Julian Anastasov
  2003-12-03  0:07         ` David S. Miller
  2003-12-03  0:39         ` David S. Miller
  3 siblings, 2 replies; 23+ messages in thread
From: Julian Anastasov @ 2003-12-02 23:40 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, netdev


	Hello,

	I'm appending new version that handles all missing cases for 
hashed values on icmp error:

- ip_rt_frag_needed now provides interface index
- ip_rt_redirect checks for tos|RTO_ONLINK too
- ip_rt_frag_needed checks also for tos|RTO_ONLINK and oif!=0
- __ip_route_output_key now ignores illegal bits (bit 1) from tos

	Please review and edit if needed.

diff -Nru a/include/net/route.h b/include/net/route.h
--- a/include/net/route.h	Tue Dec  2 23:37:17 2003
+++ b/include/net/route.h	Tue Dec  2 23:37:17 2003
@@ -122,7 +122,7 @@
 extern int		ip_route_output_key(struct rtable **, struct flowi *flp);
 extern int		ip_route_output_flow(struct rtable **rp, struct flowi *flp, struct sock *sk, int flags);
 extern int		ip_route_input(struct sk_buff*, u32 dst, u32 src, u8 tos, struct net_device *devin);
-extern unsigned short	ip_rt_frag_needed(struct iphdr *iph, unsigned short new_mtu);
+extern unsigned short	ip_rt_frag_needed(int iif, struct iphdr *iph, unsigned short new_mtu);
 extern void		ip_rt_send_redirect(struct sk_buff *skb);
 
 extern unsigned		inet_addr_type(u32 addr);
diff -Nru a/net/ipv4/icmp.c b/net/ipv4/icmp.c
--- a/net/ipv4/icmp.c	Tue Dec  2 23:37:17 2003
+++ b/net/ipv4/icmp.c	Tue Dec  2 23:37:17 2003
@@ -626,7 +626,7 @@
 							 "and DF set.\n",
 					       NIPQUAD(iph->daddr));
 			} else {
-				info = ip_rt_frag_needed(iph,
+				info = ip_rt_frag_needed(skb->dev->ifindex, iph,
 						     ntohs(icmph->un.frag.mtu));
 				if (!info)
 					goto out;
diff -Nru a/net/ipv4/route.c b/net/ipv4/route.c
--- a/net/ipv4/route.c	Tue Dec  2 23:37:17 2003
+++ b/net/ipv4/route.c	Tue Dec  2 23:37:17 2003
@@ -967,11 +967,12 @@
 void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw,
 		    u32 saddr, u8 tos, struct net_device *dev)
 {
-	int i, k;
+	int i, j, k;
 	struct in_device *in_dev = in_dev_get(dev);
 	struct rtable *rth, **rthp;
 	u32  skeys[2] = { saddr, 0 };
 	int  ikeys[2] = { dev->ifindex, 0 };
+	u8 toskeys[2];
 
 	tos &= IPTOS_RT_MASK;
 
@@ -992,11 +993,15 @@
 			goto reject_redirect;
 	}
 
+	toskeys[0] = tos;
+	toskeys[1] = tos | RTO_ONLINK;
+	if (saddr && daddr)
+	for (j = 0; j < 2; j++)
 	for (i = 0; i < 2; i++) {
 		for (k = 0; k < 2; k++) {
 			unsigned hash = rt_hash_code(daddr,
 						     skeys[i] ^ (ikeys[k] << 5),
-						     tos);
+						     toskeys[j]);
 
 			rthp=&rt_hash_table[hash].chain;
 
@@ -1007,7 +1012,7 @@
 				smp_read_barrier_depends();
 				if (rth->fl.fl4_dst != daddr ||
 				    rth->fl.fl4_src != skeys[i] ||
-				    rth->fl.fl4_tos != tos ||
+				    rth->fl.fl4_tos != toskeys[j] ||
 				    rth->fl.oif != ikeys[k] ||
 				    rth->fl.iif != 0) {
 					rthp = &rth->u.rt_next;
@@ -1237,21 +1242,26 @@
 	return 68;
 }
 
-unsigned short ip_rt_frag_needed(struct iphdr *iph, unsigned short new_mtu)
+unsigned short ip_rt_frag_needed(int iif, struct iphdr *iph, unsigned short new_mtu)
 {
-	int i;
+	int i, j, k;
 	unsigned short old_mtu = ntohs(iph->tot_len);
 	struct rtable *rth;
 	u32  skeys[2] = { iph->saddr, 0, };
 	u32  daddr = iph->daddr;
 	u8   tos = iph->tos & IPTOS_RT_MASK;
 	unsigned short est_mtu = 0;
+	u8 toskeys[2] = { tos, tos | RTO_ONLINK };
+	int  ikeys[2] = { iif, 0 };
 
 	if (ipv4_config.no_pmtu_disc)
 		return 0;
 
+	for (k = 0; k < (iif ? 2 : 1); k++)
+	for (j = 0; j < 2; j++)
 	for (i = 0; i < 2; i++) {
-		unsigned hash = rt_hash_code(daddr, skeys[i], tos);
+		unsigned hash = rt_hash_code(daddr, skeys[i] ^ (ikeys[k] << 5),
+					     toskeys[j]);
 
 		rcu_read_lock();
 		for (rth = rt_hash_table[hash].chain; rth;
@@ -1261,7 +1271,8 @@
 			    rth->fl.fl4_src == skeys[i] &&
 			    rth->rt_dst  == daddr &&
 			    rth->rt_src  == iph->saddr &&
-			    rth->fl.fl4_tos == tos &&
+			    rth->fl.fl4_tos == toskeys[j] &&
+			    rth->fl.oif == ikeys[k] &&
 			    rth->fl.iif == 0 &&
 			    !(dst_metric_locked(&rth->u.dst, RTAX_MTU))) {
 				unsigned short mtu = new_mtu;
@@ -2213,8 +2224,9 @@
 {
 	unsigned hash;
 	struct rtable *rth;
+	u8 tos = flp->fl4_tos & (IPTOS_RT_MASK | RTO_ONLINK);
 
-	hash = rt_hash_code(flp->fl4_dst, flp->fl4_src ^ (flp->oif << 5), flp->fl4_tos);
+	hash = rt_hash_code(flp->fl4_dst, flp->fl4_src ^ (flp->oif << 5), tos);
 
 	rcu_read_lock();
 	for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) {
@@ -2226,8 +2238,7 @@
 #ifdef CONFIG_IP_ROUTE_FWMARK
 		    rth->fl.fl4_fwmark == flp->fl4_fwmark &&
 #endif
-		    !((rth->fl.fl4_tos ^ flp->fl4_tos) &
-			    (IPTOS_RT_MASK | RTO_ONLINK))) {
+		    rth->fl.fl4_tos == tos) {
 			rth->u.dst.lastuse = jiffies;
 			dst_hold(&rth->u.dst);
 			rth->u.dst.__use++;

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-02 23:40       ` Julian Anastasov
@ 2003-12-03  0:07         ` David S. Miller
  2003-12-03  0:39         ` David S. Miller
  1 sibling, 0 replies; 23+ messages in thread
From: David S. Miller @ 2003-12-03  0:07 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: herbert, netdev

On Wed, 3 Dec 2003 01:40:06 +0200 (EET)
Julian Anastasov <ja@ssi.bg> wrote:

> 	Please review and edit if needed.

Thanks Julian, I will review your patch today.

But I want to mention something about this.  TOS checks
I think are pretty unreliable.

TOS has meaning within a network realm, right?  This means
that if we make a packet with TOS X, once it leaves our realm
the TOS value may be arbitrarily changed so that the TOS in
the packet has meaning within that new realm.

If a PMTU or other routing ICMP gets sent back to us after TOS changes
have been made, there is absolutely no way to reliably match the
routes properly for update.

Therefore TOS based routes cannot handle ICMP messages reliably
in the global internet.  In fact, it is possible and even likely
to apply an ICMP for the wrong TOS variant of a given route when
this TOS rewriting occurs.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-02 23:40       ` Julian Anastasov
  2003-12-03  0:07         ` David S. Miller
@ 2003-12-03  0:39         ` David S. Miller
  2003-12-03  1:43           ` Julian Anastasov
  2003-12-03 22:03           ` Julian Anastasov
  1 sibling, 2 replies; 23+ messages in thread
From: David S. Miller @ 2003-12-03  0:39 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: herbert, netdev

On Wed, 3 Dec 2003 01:40:06 +0200 (EET)
Julian Anastasov <ja@ssi.bg> wrote:

> - ip_rt_frag_needed now provides interface index

I disagree with this.

There is no connection between the route (and thus output
device) we used to send a packet and the interface on which
an ICMP for that packet comes back upon.  Think assymetric
routes.

You changes mean that for routes with specific output interfaces,
we will ignore ICMPs for those routes that arrive on other interfaces
due to assymetric routing.

Why don't you create a seperate patch that just has the TOS masking
changes?  That's much less controversial and thus something I'm likely
to apply.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-03  0:39         ` David S. Miller
@ 2003-12-03  1:43           ` Julian Anastasov
  2003-12-03 22:03           ` Julian Anastasov
  1 sibling, 0 replies; 23+ messages in thread
From: Julian Anastasov @ 2003-12-03  1:43 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, netdev


	Hello,

On Tue, 2 Dec 2003, David S. Miller wrote:

> > - ip_rt_frag_needed now provides interface index
>
> I disagree with this.
>
> There is no connection between the route (and thus output
> device) we used to send a packet and the interface on which
> an ICMP for that packet comes back upon.  Think assymetric
> routes.

	Agreed. This is a very bad case for PMTUD. It can be
even a multipath route with different PMTUs.

> You changes mean that for routes with specific output interfaces,
> we will ignore ICMPs for those routes that arrive on other interfaces
> due to assymetric routing.

	They are already ignored because the oif is a hash key.
With the current code I do not think we can find other oif
values in the current row when the oif hash key is 0. With the
changes, we are trying to walk the rows for oif=IIF and oif=0. We
can not learn how many paths and oifs we have before the smallest
pmtu that generates ICMPs. We are even not sure when the ICMP
comes if the originating packet was routed with oif 0 or not 0.
So, may be it is better to set the smallest pmtu for all these
paths, for oif 0 as before and now for oif=IIF.

        About CONFIG_IP_ROUTE_TOS, why it is not propagated
to the routing cache? Then for the fl4_tos value we will have
only two cases, onlink or not onlink. In essence, I see it
as defining IPTOS_RT_MASK to 0 when CONFIG_IP_ROUTE_TOS is not
defined. If this is working it solves so many problems: cache size,
PMTUD. Is it really so easy?

	CONFIG_IP_ROUTE_TOS disables tos match in rules and routes,
so why we have to waste memory for the route cache when all these
routes receive same path no matter the tos value?

> Why don't you create a seperate patch that just has the TOS masking
> changes?  That's much less controversial and thus something I'm likely
> to apply.

	ok, may be after some days for rethinking :) let me
know for the final thoughts and I can create patch(es) after
reading some standards.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-03  0:39         ` David S. Miller
  2003-12-03  1:43           ` Julian Anastasov
@ 2003-12-03 22:03           ` Julian Anastasov
  2003-12-05 20:43             ` David S. Miller
  1 sibling, 1 reply; 23+ messages in thread
From: Julian Anastasov @ 2003-12-03 22:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, netdev


	Hello,

On Tue, 2 Dec 2003, David S. Miller wrote:

> You changes mean that for routes with specific output interfaces,
> we will ignore ICMPs for those routes that arrive on other interfaces
> due to assymetric routing.
>
> Why don't you create a seperate patch that just has the TOS masking
> changes?  That's much less controversial and thus something I'm likely
> to apply.

	Here is a new version, all changes are:

- ip_rt_redirect works for entries oif=IIF and oif=0 as before

- ip_rt_redirect now supports RTO_ONLINK

- ip_rt_frag_needed now supports RTO_ONLINK and all oif!=0

- ifindex is not anymore a hash key. This is required for
ip_rt_frag_needed to deliver the event to all entries no
matter the oif key. I'm not sure if this is for good or for
bad for the hash table distribution. I hope jenkins hash is
ready for this as it is not a common case to have multiple
oifs per one saddr-daddr-tos.

- __ip_route_output_key now ignores illegal bits (bit 1) from tos

diff -Nru a/net/ipv4/route.c b/net/ipv4/route.c
--- a/net/ipv4/route.c	Wed Dec  3 23:37:00 2003
+++ b/net/ipv4/route.c	Wed Dec  3 23:37:00 2003
@@ -967,11 +967,11 @@
 void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw,
 		    u32 saddr, u8 tos, struct net_device *dev)
 {
-	int i, k;
+	int i, j;
 	struct in_device *in_dev = in_dev_get(dev);
 	struct rtable *rth, **rthp;
 	u32  skeys[2] = { saddr, 0 };
-	int  ikeys[2] = { dev->ifindex, 0 };
+	u8 toskeys[2];
 
 	tos &= IPTOS_RT_MASK;
 
@@ -992,11 +992,12 @@
 			goto reject_redirect;
 	}
 
+	toskeys[0] = tos;
+	toskeys[1] = tos | RTO_ONLINK;
+	if (saddr && daddr)
 	for (i = 0; i < 2; i++) {
-		for (k = 0; k < 2; k++) {
-			unsigned hash = rt_hash_code(daddr,
-						     skeys[i] ^ (ikeys[k] << 5),
-						     tos);
+		for (j = 0; j < 2; j++) {
+			unsigned hash = rt_hash_code(daddr, skeys[i], toskeys[j]);
 
 			rthp=&rt_hash_table[hash].chain;
 
@@ -1007,8 +1008,9 @@
 				smp_read_barrier_depends();
 				if (rth->fl.fl4_dst != daddr ||
 				    rth->fl.fl4_src != skeys[i] ||
-				    rth->fl.fl4_tos != tos ||
-				    rth->fl.oif != ikeys[k] ||
+				    rth->fl.fl4_tos != toskeys[j] ||
+				    (rth->fl.oif &&
+				     rth->fl.oif != dev->ifindex) ||
 				    rth->fl.iif != 0) {
 					rthp = &rth->u.rt_next;
 					continue;
@@ -1105,8 +1107,7 @@
 		} else if ((rt->rt_flags & RTCF_REDIRECTED) ||
 			   rt->u.dst.expires) {
 			unsigned hash = rt_hash_code(rt->fl.fl4_dst,
-						     rt->fl.fl4_src ^
-							(rt->fl.oif << 5),
+						     rt->fl.fl4_src,
 						     rt->fl.fl4_tos);
 #if RT_CACHE_DEBUG >= 1
 			printk(KERN_DEBUG "ip_rt_advice: redirect to "
@@ -1239,19 +1240,21 @@
 
 unsigned short ip_rt_frag_needed(struct iphdr *iph, unsigned short new_mtu)
 {
-	int i;
+	int i, j;
 	unsigned short old_mtu = ntohs(iph->tot_len);
 	struct rtable *rth;
 	u32  skeys[2] = { iph->saddr, 0, };
 	u32  daddr = iph->daddr;
 	u8   tos = iph->tos & IPTOS_RT_MASK;
 	unsigned short est_mtu = 0;
+	u8 toskeys[2] = { tos, tos | RTO_ONLINK };
 
 	if (ipv4_config.no_pmtu_disc)
 		return 0;
 
+	for (j = 0; j < 2; j++)
 	for (i = 0; i < 2; i++) {
-		unsigned hash = rt_hash_code(daddr, skeys[i], tos);
+		unsigned hash = rt_hash_code(daddr, skeys[i], toskeys[j]);
 
 		rcu_read_lock();
 		for (rth = rt_hash_table[hash].chain; rth;
@@ -1261,7 +1264,7 @@
 			    rth->fl.fl4_src == skeys[i] &&
 			    rth->rt_dst  == daddr &&
 			    rth->rt_src  == iph->saddr &&
-			    rth->fl.fl4_tos == tos &&
+			    rth->fl.fl4_tos == toskeys[j] &&
 			    rth->fl.iif == 0 &&
 			    !(dst_metric_locked(&rth->u.dst, RTAX_MTU))) {
 				unsigned short mtu = new_mtu;
@@ -1503,7 +1506,7 @@
 	RT_CACHE_STAT_INC(in_slow_mc);
 
 	in_dev_put(in_dev);
-	hash = rt_hash_code(daddr, saddr ^ (dev->ifindex << 5), tos);
+	hash = rt_hash_code(daddr, saddr, tos);
 	return rt_intern_hash(hash, rth, (struct rtable**) &skb->dst);
 
 e_nobufs:
@@ -1554,7 +1557,7 @@
 	if (!in_dev)
 		goto out;
 
-	hash = rt_hash_code(daddr, saddr ^ (fl.iif << 5), tos);
+	hash = rt_hash_code(daddr, saddr, tos);
 
 	/* Check for the most weird martians, which can be not detected
 	   by fib_lookup.
@@ -1847,7 +1850,7 @@
 	int iif = dev->ifindex;
 
 	tos &= IPTOS_RT_MASK;
-	hash = rt_hash_code(daddr, saddr ^ (iif << 5), tos);
+	hash = rt_hash_code(daddr, saddr, tos);
 
 	rcu_read_lock();
 	for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) {
@@ -2190,7 +2193,7 @@
 
 	rth->rt_flags = flags;
 
-	hash = rt_hash_code(oldflp->fl4_dst, oldflp->fl4_src ^ (oldflp->oif << 5), tos);
+	hash = rt_hash_code(oldflp->fl4_dst, oldflp->fl4_src, tos);
 	err = rt_intern_hash(hash, rth, rp);
 done:
 	if (free_res)
@@ -2213,8 +2216,9 @@
 {
 	unsigned hash;
 	struct rtable *rth;
+	u8 tos = flp->fl4_tos & (IPTOS_RT_MASK | RTO_ONLINK);
 
-	hash = rt_hash_code(flp->fl4_dst, flp->fl4_src ^ (flp->oif << 5), flp->fl4_tos);
+	hash = rt_hash_code(flp->fl4_dst, flp->fl4_src, tos);
 
 	rcu_read_lock();
 	for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) {
@@ -2226,8 +2230,7 @@
 #ifdef CONFIG_IP_ROUTE_FWMARK
 		    rth->fl.fl4_fwmark == flp->fl4_fwmark &&
 #endif
-		    !((rth->fl.fl4_tos ^ flp->fl4_tos) &
-			    (IPTOS_RT_MASK | RTO_ONLINK))) {
+		    rth->fl.fl4_tos == tos) {
 			rth->u.dst.lastuse = jiffies;
 			dst_hold(&rth->u.dst);
 			rth->u.dst.__use++;

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-03 22:03           ` Julian Anastasov
@ 2003-12-05 20:43             ` David S. Miller
  2003-12-06  7:52               ` Julian Anastasov
  0 siblings, 1 reply; 23+ messages in thread
From: David S. Miller @ 2003-12-05 20:43 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: herbert, netdev

On Thu, 4 Dec 2003 00:03:20 +0200 (EET)
Julian Anastasov <ja@ssi.bg> wrote:

> - ip_rt_redirect now supports RTO_ONLINK
> 
> - ip_rt_frag_needed now supports RTO_ONLINK and all oif!=0

Let's discuss what RTO_ONLINK means :)

It means, "Do not route this packet".  It goes to the local subnet
and that's the end of the story.  Therefore there are no PMTU nor
redirects to handle when this TOS bit is set.

RTO_ONLINK is set by these two cases:

1) ARP 
2) RAW/UDP sockets when MSG_DONTROUTE is specified by the user.

So this patch still needs some work.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-05 20:43             ` David S. Miller
@ 2003-12-06  7:52               ` Julian Anastasov
  2003-12-10 23:15                 ` David S. Miller
  0 siblings, 1 reply; 23+ messages in thread
From: Julian Anastasov @ 2003-12-06  7:52 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, netdev


	Hello,

On Fri, 5 Dec 2003, David S. Miller wrote:

> Let's discuss what RTO_ONLINK means :)
>
> It means, "Do not route this packet".  It goes to the local subnet
> and that's the end of the story.  Therefore there are no PMTU nor
> redirects to handle when this TOS bit is set.

	You are right about the case with redirects but for PMTUD
I'm still not sure. Here is one example: host A (we) thinks
that host B is onlink (rt_dst == rt_gateway). But host B runs
DNAT, tunneling or cluster software and forwards the packet
through other hosts which generate frag_needed. Such example
is IPVS TUN mode, I now see that it is not working because IPVS
on host B does not handle correctly the ICMP error and it can
not reach host A - a new difficult thing to fix.

> RTO_ONLINK is set by these two cases:
>
> 1) ARP
> 2) RAW/UDP sockets when MSG_DONTROUTE is specified by the user.

	UDP+RTO_ONLINK+PMTUD is still valid if we want to support
the above example setup. But I'm not sure someone will use
such combination of parameters. Should I remove the RTO_ONLINK
case from PMTUD?

	I have some questions about the redirects:

	Assume we have direct (rt_dst==rt_gateway) route with
SCOPE_HOST and shared_media is ON (if shared media is OFF
ip_fib_check_default already avoids SCOPE_HOST routes). I do
not see whether the standards cover such case, our target sends redirect
message but we are sure we hit the target IP directly. Is it supposed we
to change rt_gateway if rt_gateway==rt_dst ? I now included
such check in ip_rt_redirect but may be have to remove it.
IOW, the question is whether the ICMP redirects modify only
routes via gateway when shared_media is ON?

> So this patch still needs some work.

	OK, here is a new version, may be before the final one.
Changes from previous version:

- removed the RTO_ONLINK case from ip_rt_redirect

- removed the useless 'saddr && daddr' check which was added in
previous changes

- added temporarily 'rth->rt_dst == rth->rt_gateway' in
ip_rt_redirect. Is it needed?

diff -Nru a/net/ipv4/route.c b/net/ipv4/route.c
--- a/net/ipv4/route.c	Sat Dec  6 09:05:58 2003
+++ b/net/ipv4/route.c	Sat Dec  6 09:05:58 2003
@@ -967,11 +967,10 @@
 void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw,
 		    u32 saddr, u8 tos, struct net_device *dev)
 {
-	int i, k;
+	int i;
 	struct in_device *in_dev = in_dev_get(dev);
 	struct rtable *rth, **rthp;
 	u32  skeys[2] = { saddr, 0 };
-	int  ikeys[2] = { dev->ifindex, 0 };
 
 	tos &= IPTOS_RT_MASK;
 
@@ -993,10 +992,7 @@
 	}
 
 	for (i = 0; i < 2; i++) {
-		for (k = 0; k < 2; k++) {
-			unsigned hash = rt_hash_code(daddr,
-						     skeys[i] ^ (ikeys[k] << 5),
-						     tos);
+			unsigned hash = rt_hash_code(daddr, skeys[i], tos);
 
 			rthp=&rt_hash_table[hash].chain;
 
@@ -1008,7 +1004,9 @@
 				if (rth->fl.fl4_dst != daddr ||
 				    rth->fl.fl4_src != skeys[i] ||
 				    rth->fl.fl4_tos != tos ||
-				    rth->fl.oif != ikeys[k] ||
+				    (rth->fl.oif &&
+				     rth->fl.oif != dev->ifindex) ||
+				    rth->rt_dst == rth->rt_gateway ||
 				    rth->fl.iif != 0) {
 					rthp = &rth->u.rt_next;
 					continue;
@@ -1075,7 +1073,6 @@
 			rcu_read_unlock();
 		do_next:
 			;
-		}
 	}
 	in_dev_put(in_dev);
 	return;
@@ -1105,8 +1102,7 @@
 		} else if ((rt->rt_flags & RTCF_REDIRECTED) ||
 			   rt->u.dst.expires) {
 			unsigned hash = rt_hash_code(rt->fl.fl4_dst,
-						     rt->fl.fl4_src ^
-							(rt->fl.oif << 5),
+						     rt->fl.fl4_src,
 						     rt->fl.fl4_tos);
 #if RT_CACHE_DEBUG >= 1
 			printk(KERN_DEBUG "ip_rt_advice: redirect to "
@@ -1239,19 +1235,21 @@
 
 unsigned short ip_rt_frag_needed(struct iphdr *iph, unsigned short new_mtu)
 {
-	int i;
+	int i, j;
 	unsigned short old_mtu = ntohs(iph->tot_len);
 	struct rtable *rth;
 	u32  skeys[2] = { iph->saddr, 0, };
 	u32  daddr = iph->daddr;
 	u8   tos = iph->tos & IPTOS_RT_MASK;
 	unsigned short est_mtu = 0;
+	u8 toskeys[2] = { tos, tos | RTO_ONLINK };
 
 	if (ipv4_config.no_pmtu_disc)
 		return 0;
 
+	for (j = 0; j < 2; j++)
 	for (i = 0; i < 2; i++) {
-		unsigned hash = rt_hash_code(daddr, skeys[i], tos);
+		unsigned hash = rt_hash_code(daddr, skeys[i], toskeys[j]);
 
 		rcu_read_lock();
 		for (rth = rt_hash_table[hash].chain; rth;
@@ -1261,7 +1259,7 @@
 			    rth->fl.fl4_src == skeys[i] &&
 			    rth->rt_dst  == daddr &&
 			    rth->rt_src  == iph->saddr &&
-			    rth->fl.fl4_tos == tos &&
+			    rth->fl.fl4_tos == toskeys[j] &&
 			    rth->fl.iif == 0 &&
 			    !(dst_metric_locked(&rth->u.dst, RTAX_MTU))) {
 				unsigned short mtu = new_mtu;
@@ -1503,7 +1501,7 @@
 	RT_CACHE_STAT_INC(in_slow_mc);
 
 	in_dev_put(in_dev);
-	hash = rt_hash_code(daddr, saddr ^ (dev->ifindex << 5), tos);
+	hash = rt_hash_code(daddr, saddr, tos);
 	return rt_intern_hash(hash, rth, (struct rtable**) &skb->dst);
 
 e_nobufs:
@@ -1554,7 +1552,7 @@
 	if (!in_dev)
 		goto out;
 
-	hash = rt_hash_code(daddr, saddr ^ (fl.iif << 5), tos);
+	hash = rt_hash_code(daddr, saddr, tos);
 
 	/* Check for the most weird martians, which can be not detected
 	   by fib_lookup.
@@ -1847,7 +1845,7 @@
 	int iif = dev->ifindex;
 
 	tos &= IPTOS_RT_MASK;
-	hash = rt_hash_code(daddr, saddr ^ (iif << 5), tos);
+	hash = rt_hash_code(daddr, saddr, tos);
 
 	rcu_read_lock();
 	for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) {
@@ -2190,7 +2188,7 @@
 
 	rth->rt_flags = flags;
 
-	hash = rt_hash_code(oldflp->fl4_dst, oldflp->fl4_src ^ (oldflp->oif << 5), tos);
+	hash = rt_hash_code(oldflp->fl4_dst, oldflp->fl4_src, tos);
 	err = rt_intern_hash(hash, rth, rp);
 done:
 	if (free_res)
@@ -2213,8 +2211,9 @@
 {
 	unsigned hash;
 	struct rtable *rth;
+	u8 tos = flp->fl4_tos & (IPTOS_RT_MASK | RTO_ONLINK);
 
-	hash = rt_hash_code(flp->fl4_dst, flp->fl4_src ^ (flp->oif << 5), flp->fl4_tos);
+	hash = rt_hash_code(flp->fl4_dst, flp->fl4_src, tos);
 
 	rcu_read_lock();
 	for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) {
@@ -2226,8 +2225,7 @@
 #ifdef CONFIG_IP_ROUTE_FWMARK
 		    rth->fl.fl4_fwmark == flp->fl4_fwmark &&
 #endif
-		    !((rth->fl.fl4_tos ^ flp->fl4_tos) &
-			    (IPTOS_RT_MASK | RTO_ONLINK))) {
+		    rth->fl.fl4_tos == tos) {
 			rth->u.dst.lastuse = jiffies;
 			dst_hold(&rth->u.dst);
 			rth->u.dst.__use++;

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ROUTE] PMTU only works on half the time
  2003-12-06  7:52               ` Julian Anastasov
@ 2003-12-10 23:15                 ` David S. Miller
  0 siblings, 0 replies; 23+ messages in thread
From: David S. Miller @ 2003-12-10 23:15 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: herbert, netdev

On Sat, 6 Dec 2003 09:52:01 +0200 (EET)
Julian Anastasov <ja@ssi.bg> wrote:

> 	You are right about the case with redirects but for PMTUD
> I'm still not sure. Here is one example: host A (we) thinks
> that host B is onlink (rt_dst == rt_gateway). But host B runs
> DNAT, tunneling or cluster software and forwards the packet
> through other hosts which generate frag_needed. Such example
> is IPVS TUN mode, I now see that it is not working because IPVS
> on host B does not handle correctly the ICMP error and it can
> not reach host A - a new difficult thing to fix.

Entites that proport to provide IP services onlink, yet really
do not, need to preserve all behaviors such that they appear
to be onlink as far as other hosts can see.

This means no PMTU for onlink destinations.

> 	UDP+RTO_ONLINK+PMTUD is still valid if we want to support
> the above example setup. But I'm not sure someone will use
> such combination of parameters. Should I remove the RTO_ONLINK
> case from PMTUD?

That is what I think is best.

> 	Assume we have direct (rt_dst==rt_gateway) route with
> SCOPE_HOST and shared_media is ON (if shared media is OFF
> ip_fib_check_default already avoids SCOPE_HOST routes). I do
> not see whether the standards cover such case, our target sends redirect
> message but we are sure we hit the target IP directly. Is it supposed we
> to change rt_gateway if rt_gateway==rt_dst ? I now included
> such check in ip_rt_redirect but may be have to remove it.
> IOW, the question is whether the ICMP redirects modify only
> routes via gateway when shared_media is ON?

Interesting, I've never explored this area.  I honestly don't know,
and I will study this issue some more.

> 	OK, here is a new version, may be before the final one.
> Changes from previous version:
> 
> - removed the RTO_ONLINK case from ip_rt_redirect

OK.

> - removed the useless 'saddr && daddr' check which was added in
> previous changes

Ok.

> - added temporarily 'rth->rt_dst == rth->rt_gateway' in
> ip_rt_redirect. Is it needed?

I'll get back to you on this :)

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2003-12-10 23:15 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-12-01 20:16 [ROUTE] PMTU only works on half the time Herbert Xu
2003-12-01 20:47 ` Herbert Xu
2003-12-01 21:51   ` David S. Miller
2003-12-01 22:05     ` Herbert Xu
2003-12-01 22:21       ` David S. Miller
2003-12-01 23:22         ` David S. Miller
2003-12-02 10:10           ` Herbert Xu
2003-12-02 10:27             ` David S. Miller
2003-12-02 10:33               ` Herbert Xu
2003-12-01 23:30   ` Julian Anastasov
2003-12-01 23:50     ` David S. Miller
2003-12-02  0:04       ` Julian Anastasov
2003-12-02  0:07       ` Julian Anastasov
2003-12-02  0:08         ` David S. Miller
2003-12-02  1:53       ` Julian Anastasov
2003-12-02 23:40       ` Julian Anastasov
2003-12-03  0:07         ` David S. Miller
2003-12-03  0:39         ` David S. Miller
2003-12-03  1:43           ` Julian Anastasov
2003-12-03 22:03           ` Julian Anastasov
2003-12-05 20:43             ` David S. Miller
2003-12-06  7:52               ` Julian Anastasov
2003-12-10 23:15                 ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).