[PATCH v2 -next] net: fib: use fib result when zero-length prefix aliases exist

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 -next] net: fib: use fib result when zero-length prefix aliases exist
@ 2015-07-17 17:37 Florian Westphal
  2015-07-17 22:47 ` Julian Anastasov
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Westphal @ 2015-07-17 17:37 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal, Alexander Duyck

default route selection is not deterministic when TOS keys are used:

ip route del default

ip route add tos 0x00 via 10.2.100.100
ip route add tos 0x04 via 10.2.100.101
ip route add tos 0x08 via 10.2.100.102
ip route add tos 0x0C via 10.2.100.103
ip route add tos 0x10 via 10.2.100.104

[ i.e. 5 routes with prefix length 0, differentiated via TOS key ]

ip route get 10.3.1.1 tos 0x4
-> 10.2.100.101
ip route get 10.3.1.1 tos 0x8
-> 10.2.100.102
ip route get tos 0x0C
-> 10.2.100.103

But for 0x10, we'll get round-robin behavour among all the aliases.
Repeated invocations return .100, 101, 102, etc. in turn.

This behaviour is not new  -- fib_select_default can be traced back to
fn_hash_select_default in CVS history.

Routing cache made 'round-robin' behaviour less visible.

This changes fib_select_default to not change the FIB chosen result EXCEPT
if this nexthop appears to be unreachable.

fib_detect_death() logic is reversed -- we consider a nexthop 'dead' only
if it has a neigh entry in unreachable state.

Only then we search fib_aliases for an alternative and use one of these in
a round-robin fashion.  If all are believed to be unreachable, no change is
made and fib-chosen nh_gw is used.

Reported-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Changes since v1:
  Address comments from Alex Duyck:
  - use if (fib_nud_is_unreach( .. rather than temporary boolean retval
  - rename last_* varibles to fi_, they're not the last item in the list...
  - kill pointless if() statement, if order < 0, then fi_last is < 0 too

 net/ipv4/fib_semantics.c | 80 ++++++++++++++++++++++--------------------------
 1 file changed, 36 insertions(+), 44 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c7358ea..2cdf8d7 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -410,28 +410,24 @@ errout:
 		rtnl_set_sk_err(info->nl_net, RTNLGRP_IPV4_ROUTE, err);
 }
 
-static int fib_detect_death(struct fib_info *fi, int order,
-			    struct fib_info **last_resort, int *last_idx,
-			    int dflt)
+static bool fib_nud_is_unreach(const struct fib_info *fi)
 {
 	struct neighbour *n;
 	int state = NUD_NONE;
 
-	n = neigh_lookup(&arp_tbl, &fi->fib_nh[0].nh_gw, fi->fib_dev);
-	if (n) {
+	local_bh_disable();
+
+	n = __neigh_lookup_noref(&arp_tbl, &fi->fib_nh[0].nh_gw, fi->fib_dev);
+	if (n)
 		state = n->nud_state;
-		neigh_release(n);
-	}
-	if (state == NUD_REACHABLE)
-		return 0;
-	if ((state & NUD_VALID) && order != dflt)
-		return 0;
-	if ((state & NUD_VALID) ||
-	    (*last_idx < 0 && order > dflt)) {
-		*last_resort = fi;
-		*last_idx = order;
-	}
-	return 1;
+
+	local_bh_enable();
+
+	/* Caller might be able to find alternate (reachable) nexthop */
+	if (state & (NUD_INCOMPLETE | NUD_FAILED))
+		return true;
+
+	return false;
 }
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
@@ -1204,12 +1200,16 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event)
 /* Must be invoked inside of an RCU protected region.  */
 void fib_select_default(struct fib_result *res)
 {
-	struct fib_info *fi = NULL, *last_resort = NULL;
 	struct hlist_head *fa_head = res->fa_head;
+	struct fib_info *fi = NULL;
 	struct fib_table *tb = res->table;
-	int order = -1, last_idx = -1;
+	int order = -1, fi_idx = -1;
 	struct fib_alias *fa;
 
+	if (likely(!fib_nud_is_unreach(res->fi)))
+		return;
+
+	/* attempt to pick another nexthop */
 	hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
 		struct fib_info *next_fi = fa->fa_info;
 
@@ -1223,38 +1223,30 @@ void fib_select_default(struct fib_result *res)
 		    next_fi->fib_nh[0].nh_scope != RT_SCOPE_LINK)
 			continue;
 
+		order++;
+
+		if (next_fi == res->fi) /* already tested, not reachable */
+			continue;
+
 		fib_alias_accessed(fa);
 
-		if (!fi) {
-			if (next_fi != res->fi)
+		if (fib_nud_is_unreach(next_fi))
+			continue;
+
+		/* try to round-robin among all fa_aliases in case
+		 * res->fi nexthop is unreachable.
+		 */
+		if (fi == NULL || order > tb->tb_default) {
+			fi = next_fi;
+			fi_idx = order;
+			if (order > tb->tb_default)
 				break;
-		} else if (!fib_detect_death(fi, order, &last_resort,
-					     &last_idx, tb->tb_default)) {
-			fib_result_assign(res, fi);
-			tb->tb_default = order;
-			goto out;
 		}
-		fi = next_fi;
-		order++;
 	}
 
-	if (order <= 0 || !fi) {
-		tb->tb_default = -1;
-		goto out;
-	}
-
-	if (!fib_detect_death(fi, order, &last_resort, &last_idx,
-				tb->tb_default)) {
+	if (fi)
 		fib_result_assign(res, fi);
-		tb->tb_default = order;
-		goto out;
-	}
-
-	if (last_idx >= 0)
-		fib_result_assign(res, last_resort);
-	tb->tb_default = last_idx;
-out:
-	return;
+	tb->tb_default = fi_idx;
 }
 
 /*
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 -next] net: fib: use fib result when zero-length prefix aliases exist
  2015-07-17 17:37 [PATCH v2 -next] net: fib: use fib result when zero-length prefix aliases exist Florian Westphal
@ 2015-07-17 22:47 ` Julian Anastasov
  2015-07-18 18:05   ` Florian Westphal
  0 siblings, 1 reply; 8+ messages in thread
From: Julian Anastasov @ 2015-07-17 22:47 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Alexander Duyck


	Hello,

On Fri, 17 Jul 2015, Florian Westphal wrote:

> default route selection is not deterministic when TOS keys are used:

	In fact, TOS should be matched just like in
fib_table_lookup but it is not.

> This changes fib_select_default to not change the FIB chosen result EXCEPT
> if this nexthop appears to be unreachable.
> 
> fib_detect_death() logic is reversed -- we consider a nexthop 'dead' only
> if it has a neigh entry in unreachable state.

	The only difference I see is that NUD_NODE is
declared by fib_nud_is_unreach() as reachable. May be
it is better, for example, new route in NUD_NONE state
will be tried for a while, until its state is determined.

> +	if (likely(!fib_nud_is_unreach(res->fi)))
> +		return;

	here first is unreachable...

> +
> +	/* attempt to pick another nexthop */
>  	hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
>  		struct fib_info *next_fi = fa->fa_info;
>  
> @@ -1223,38 +1223,30 @@ void fib_select_default(struct fib_result *res)
>  		    next_fi->fib_nh[0].nh_scope != RT_SCOPE_LINK)
>  			continue;
>  
> +		order++;
> +
> +		if (next_fi == res->fi) /* already tested, not reachable */
> +			continue;
> +
>  		fib_alias_accessed(fa);
>  
> -		if (!fi) {
> -			if (next_fi != res->fi)
> +		if (fib_nud_is_unreach(next_fi))

			all others are unreachable too...

> +			continue;
> +
> +		/* try to round-robin among all fa_aliases in case
> +		 * res->fi nexthop is unreachable.

	round-robin only among reachables? But original algorithm
can round-robin among unreachables. State will not change
without trying some packets to unreachable dests.

> +		 */
> +		if (fi == NULL || order > tb->tb_default) {
> +			fi = next_fi;
> +			fi_idx = order;
> +			if (order > tb->tb_default)
>  				break;

	1=unreac, 2=reach, 3=tb_default(reach), 4=reach.
We like 2 (fi == NULL), why we continue after 2, skip
default 3 (because order > tb->tb_default is false) and
finally select 4 (order=4 > tb->tb_default=3)?

> +	tb->tb_default = fi_idx;

	And we do not round-robin if all look unreachable?

	I'm not yet sure if fib_detect_death logic should
be changed. What is strange is that we can switch between
two NUD_VALID entries...

	The __neigh_lookup_noref usage looks ok.
Not sure about the NUD_NODE change, may be it is ok to
be added in fib_detect_death. As for fib_select_default,
may be only change like below is needed. It additionally
restricts the alternative routes to same prefix, same TOS.
The TOS restriction was missing from long time but the
prefix check was lost recently when fa_head was changed
to contain aliases from different prefixes.

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 49c142b..0b1af68 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -290,7 +290,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb);
 int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 			u8 tos, int oif, struct net_device *dev,
 			struct in_device *idev, u32 *itag);
-void fib_select_default(struct fib_result *res);
+void fib_select_default(const struct flowi4 *flp, struct fib_result *res);
 #ifdef CONFIG_IP_ROUTE_CLASSID
 static inline int fib_num_tclassid_users(struct net *net)
 {
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c7358ea..088b2a5 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1202,21 +1202,29 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event)
 }
 
 /* Must be invoked inside of an RCU protected region.  */
-void fib_select_default(struct fib_result *res)
+void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
 {
 	struct fib_info *fi = NULL, *last_resort = NULL;
 	struct hlist_head *fa_head = res->fa_head;
 	struct fib_table *tb = res->table;
+	u8 slen = 32 - res->prefixlen;
 	int order = -1, last_idx = -1;
 	struct fib_alias *fa;
 
 	hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
 		struct fib_info *next_fi = fa->fa_info;
 
+		if (fa->fa_slen != slen)
+			continue;
 		if (next_fi->fib_scope != res->scope ||
 		    fa->fa_type != RTN_UNICAST)
 			continue;
 
+		if (fa->tb_id != tb->tb_id)
+			continue;
+		if (fa->fa_tos && fa->fa_tos != flp->flowi4_tos)
+			continue;
+
 		if (next_fi->fib_priority > res->fi->fib_priority)
 			break;
 		if (!next_fi->fib_nh[0].nh_gw ||
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index d0362a2..e681b85 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2176,7 +2176,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 	if (!res.prefixlen &&
 	    res.table->tb_num_default > 1 &&
 	    res.type == RTN_UNICAST && !fl4->flowi4_oif)
-		fib_select_default(&res);
+		fib_select_default(fl4, &res);
 
 	if (!fl4->saddr)
 		fl4->saddr = FIB_RES_PREFSRC(net, res);

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 -next] net: fib: use fib result when zero-length prefix aliases exist
  2015-07-17 22:47 ` Julian Anastasov
@ 2015-07-18 18:05   ` Florian Westphal
  2015-07-19 11:43     ` Julian Anastasov
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Westphal @ 2015-07-18 18:05 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: netdev, Alexander Duyck

Julian Anastasov <ja@ssi.bg> wrote:

[ Dave, please toss my patch, its either v3 or something else entirely ]

> 	In fact, TOS should be matched just like in
> fib_table_lookup but it is not.
> 
> > This changes fib_select_default to not change the FIB chosen result EXCEPT
> > if this nexthop appears to be unreachable.
> > 
> > fib_detect_death() logic is reversed -- we consider a nexthop 'dead' only
> > if it has a neigh entry in unreachable state.
> 
> 	The only difference I see is that NUD_NODE is
> declared by fib_nud_is_unreach() as reachable. May be
> it is better, for example, new route in NUD_NONE state
> will be tried for a while, until its state is determined.

fib_detect_death considers neigh_lookup() returning NULL as unreach.

I don't think its good idea and should rather only skip if NUD is
failed or incomplete.

> > +	if (likely(!fib_nud_is_unreach(res->fi)))
> > +		return;
> 
> 	here first is unreachable...
> > +	/* attempt to pick another nexthop */
> >  	hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
> >  		struct fib_info *next_fi = fa->fa_info;
> >  
> > @@ -1223,38 +1223,30 @@ void fib_select_default(struct fib_result *res)
> >  		    next_fi->fib_nh[0].nh_scope != RT_SCOPE_LINK)
> >  			continue;
> >  
> > +		order++;
> > +
> > +		if (next_fi == res->fi) /* already tested, not reachable */
> > +			continue;
> > +
> >  		fib_alias_accessed(fa);
> >  
> > -		if (!fi) {
> > -			if (next_fi != res->fi)
> > +		if (fib_nud_is_unreach(next_fi))
> 
> 			all others are unreachable too...
> 
> > +			continue;
> > +
> > +		/* try to round-robin among all fa_aliases in case
> > +		 * res->fi nexthop is unreachable.
> 
> 	round-robin only among reachables? But original algorithm
> can round-robin among unreachables. State will not change
> without trying some packets to unreachable dests.

Hmm... thanks for pointing that out.

But thats confusing.  If thats true, why does fib_detect_death
even exist?  If we skip unreachables, they cannot become
reachable again...?

> > +			fi = next_fi;
> > +			fi_idx = order;
> > +			if (order > tb->tb_default)
> >  				break;
> 
> 	1=unreac, 2=reach, 3=tb_default(reach), 4=reach.
> We like 2 (fi == NULL), why we continue after 2, skip
> default 3 (because order > tb->tb_default is false) and
> finally select 4 (order=4 > tb->tb_default=3)?

Hmm.  Let me ask different question.
What is the supposed behaviour in the case you describe?

Removing order > default
test means we'll always pick first reachable in the list, no?

I don't think thats bad, but, how/why is is that 'better'?

> > +	tb->tb_default = fi_idx;
> 
> 	And we do not round-robin if all look unreachable?

Right.  One possible solution might be this patch, but alter
fib_nud_is_unreach() to also check timer_pending(n->timer), i.e.

if (state & (NUD_INCOMPLETE | NUD_FAILED))
	return timer_was_pending ? UNREACHABLE : REACHABLE;

This way we would re-try gw not only if neigh doesn't exist but
also if no retry is pending.

What do you think?

> As for fib_select_default,
> may be only change like below is needed. It additionally
> restricts the alternative routes to same prefix, same TOS.
> The TOS restriction was missing from long time but the
> prefix check was lost recently when fa_head was changed
> to contain aliases from different prefixes.

>  /* Must be invoked inside of an RCU protected region.  */
> -void fib_select_default(struct fib_result *res)
> +void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
>  {
>  	struct fib_info *fi = NULL, *last_resort = NULL;
>  	struct hlist_head *fa_head = res->fa_head;
>  	struct fib_table *tb = res->table;
> +	u8 slen = 32 - res->prefixlen;
>  	int order = -1, last_idx = -1;
>  	struct fib_alias *fa;
>  
>  	hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
>  		struct fib_info *next_fi = fa->fa_info;
>  
> +		if (fa->fa_slen != slen)
> +			continue;
>  		if (next_fi->fib_scope != res->scope ||
>  		    fa->fa_type != RTN_UNICAST)
>  			continue;
>  
> +		if (fa->tb_id != tb->tb_id)
> +			continue;
> +		if (fa->fa_tos && fa->fa_tos != flp->flowi4_tos)
> +			continue;

Hmm, this means we won't consider alternative if tos doesn't match,
except 0.

So with
ip route add tos 0x00 via 192.168.1.1
ip route add tos 0x04 via 192.168.1.2
ip route add tos 0x08 via 192.168.1.3
ip route add tos 0x0c via 192.168.1.4
ip route add tos 0x10 via 192.168.1.5

and ping -Q 1.2.3.4

if 1.5 is unreachable, we pick 1.1 (tos 0x0).

Others are not attempted.
Do you consider that 'correct'?  I don't see why
we can use tos 0x00 but not e.g. 0x04 in that case.

If .5 becomes available, we won't attempt it and still use 1.1/tos 0.

Thanks for looking into this.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 -next] net: fib: use fib result when zero-length prefix aliases exist
  2015-07-18 18:05   ` Florian Westphal
@ 2015-07-19 11:43     ` Julian Anastasov
  2015-07-19 23:03       ` Florian Westphal
  0 siblings, 1 reply; 8+ messages in thread
From: Julian Anastasov @ 2015-07-19 11:43 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Alexander Duyck


	Hello,

On Sat, 18 Jul 2015, Florian Westphal wrote:

> Julian Anastasov <ja@ssi.bg> wrote:
> 
> > 	The only difference I see is that NUD_NODE is
> > declared by fib_nud_is_unreach() as reachable. May be
> > it is better, for example, new route in NUD_NONE state
> > will be tried for a while, until its state is determined.
> 
> fib_detect_death considers neigh_lookup() returning NULL as unreach.

	I mean, NUD_NODE is possible state, for example,
if entry is created by user space for silly reasons.
For example:

ip neigh add $IP dev $DEV nud none
ip neigh list nud none

	It is present and not used yet. Even ip route get
can not trigger neigh resolving, state will remain same.
Only traffic can trigger resolving.

> I don't think its good idea and should rather only skip if NUD is
> failed or incomplete.

	OK, then we are back to NUD_VALID usage.

> > 	round-robin only among reachables? But original algorithm
> > can round-robin among unreachables. State will not change
> > without trying some packets to unreachable dests.
> 
> Hmm... thanks for pointing that out.
> 
> But thats confusing.  If thats true, why does fib_detect_death
> even exist?  If we skip unreachables, they cannot become
> reachable again...?

	No, we return them one by one. If we direct traffic
to some alternative GW its state may switch to reachable. If
this GW is unreachable, the next try will select next alternative
GW. The algorithm tries to probe the unreachable GWs until
the traffic causes successful ARP resolving and selecting
one GW as reachable.

> > 	1=unreac, 2=reach, 3=tb_default(reach), 4=reach.
> > We like 2 (fi == NULL), why we continue after 2, skip
> > default 3 (because order > tb->tb_default is false) and
> > finally select 4 (order=4 > tb->tb_default=3)?
> 
> Hmm.  Let me ask different question.
> What is the supposed behaviour in the case you describe?

	I think, GW 2 should be selected as first reachable,
it should not be skipped.

	I decode fib_detect_death and fib_select_default as follows:

1. only one fi? keep tb_default=-1 no matter the state

2. prefer the first NUD_REACHABLE or NUD_VALID that is not tb_default.
I see the following cases:

2.1. NUD_REACHABLE is before another NUD_VALID => use only the 
NUD_REACHABLE

2.2. NUD_VALID is before another NUD_VALID/NUD_REACHABLE => switch
between them. I don't know the reason for this logic, though.
Note that NUD_REACHABLE is part of the NUD_VALID mask.

3. if there are no other NUD_VALID and the current (tb_default) is
NUD_VALID use it again

4. try all (!NUD_VALID) GWs one by one in a round-robin fashion by
increasing tb_default from -1 to max and then back to -1.
Such round-robin is attempted if there are no NUD_VALID GWs.
This is used after idle periods or if new GWs are added to
direct traffic for ARP probing.

> Removing order > default
> test means we'll always pick first reachable in the list, no?

	First reachable is ok, why not? But fib_detect_death
tries some differentiation between NUD_REACHABLE and
NUD_VALID.

> I don't think thats bad, but, how/why is is that 'better'?
> 
> > > +	tb->tb_default = fi_idx;
> > 
> > 	And we do not round-robin if all look unreachable?
> 
> Right.  One possible solution might be this patch, but alter
> fib_nud_is_unreach() to also check timer_pending(n->timer), i.e.
> 
> if (state & (NUD_INCOMPLETE | NUD_FAILED))
> 	return timer_was_pending ? UNREACHABLE : REACHABLE;

	Such timer_was_pending check is not needed. State
specifies if timer is running.

> This way we would re-try gw not only if neigh doesn't exist but
> also if no retry is pending.

	You mean:

- NUD_FAILED/NUD_NONE: no timer is pending, no current probing,
we can direct traffic that will set NUD_INCOMPLETE and will
send ARP probe

- NUD_INCOMPLETE: GW is currently probed, do not use it for
round-robin, direct traffic to another NUD_FAILED/NUD_NONE GW

	Still, above states should be considered "unreachable",
the difference will be only for the round-robin algorithm
when all GWs are unreachable.

> What do you think?

	But what do you think about above fib_detect_death
logic? I think, your tests fail just because ip route get
does not trigger traffic that will cause ARP resolution and
because fib_select_default is buggy to not check slen and
especially the tos.

> > +		if (fa->tb_id != tb->tb_id)
> > +			continue;
> > +		if (fa->fa_tos && fa->fa_tos != flp->flowi4_tos)
> > +			continue;
> 
> Hmm, this means we won't consider alternative if tos doesn't match,
> except 0.
> 
> So with
> ip route add tos 0x00 via 192.168.1.1
> ip route add tos 0x04 via 192.168.1.2
> ip route add tos 0x08 via 192.168.1.3
> ip route add tos 0x0c via 192.168.1.4
> ip route add tos 0x10 via 192.168.1.5
> 
> and ping -Q 1.2.3.4
> 
> if 1.5 is unreachable, we pick 1.1 (tos 0x0).
> 
> Others are not attempted.

	Yep

> Do you consider that 'correct'?  I don't see why
> we can use tos 0x00 but not e.g. 0x04 in that case.

	It is illegal to return alternative routes that should
be selected for another TOS, 192.168.1.2 should be selected
only for packets with TOS 0x04. And TOS 0x00 in packets can not
be matched by definition, it is used as 'match any TOS' value.

	Also, without TOS restriction the 'if (next_fi != res->fi)'
check will avoid alternatives for routes that have TOS with lower
value, for example:

ip route add tos 0x20 via 1.1.1.1
ip route add tos 0x20 via 2.2.2.2
ip route add tos 0x10 via 3.3.3.3
ip route add tos 0x10 via 4.4.4.4
ip route add tos 0x00 via 5.5.5.5

	fib_select_default from recent kernels always starts
from fa_head pointing to 1.1.1.1, so if res->fi points to the
same place the alternatives for TOS 0x20 will be considered.

	But if res->fi selects 3.3.3.3 for TOS 0x10 the
'if (next_fi != res->fi)' breaks and the 4.4.4.4 alternative
is ignored => tb_default=-1.

	In your example in first email, TOS 0x10 does
round-robin because it is the highest TOS and its alt
routes are considered. Due to the bug in fib_select_default
for lower TOS values no alt GWs are considered. That explains
the behaviour you see.

> If .5 becomes available, we won't attempt it and still use 1.1/tos 0.

	1.5 is before 1.1 in fa_head list, so it has priority
among the reachable/valid routes.

	The list with alternative routes should be:

- all routes with same TOS starting from the res->fi. The
TOS check in my patch will skip routes with higher TOS value
and then with lower TOS value that is not 0.

- if above TOS is not 0 we should consider all routes with TOS 0.
In above example, we have 3 possible lists of alt routes:

- IP.TOS 0x20: 1.1.1.1, 2.2.2.2, 5.5.5.5
- IP.TOS 0x10: 3.3.3.3, 4.4.4.4, 5.5.5.5
- other TOS: 5.5.5.5

	The funny part is that tb->tb_default is one
and can't remember the order for every list. May be
tb_default should be moved to fa (the first one for TOS).
In above example, the first fa is 1.1.1.1, 3.3.3.3
and 5.5.5.5.

	So, I think, there is no big reason to change
fib_detect_death. But fib_select_default needs to filter
by slen and TOS and may be to move tb_default to fa,
in case one wants to use TOS in alt routes.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 -next] net: fib: use fib result when zero-length prefix aliases exist
  2015-07-19 11:43     ` Julian Anastasov
@ 2015-07-19 23:03       ` Florian Westphal
  2015-07-20  7:49         ` Julian Anastasov
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Westphal @ 2015-07-19 23:03 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Florian Westphal, netdev, Alexander Duyck

Julian Anastasov <ja@ssi.bg> wrote:
> ip neigh add $IP dev $DEV nud none
> ip neigh list nud none
> 
> 	It is present and not used yet. Even ip route get
> can not trigger neigh resolving, state will remain same.
> Only traffic can trigger resolving.

Right.

> > > 	round-robin only among reachables? But original algorithm
> > > can round-robin among unreachables. State will not change
> > > without trying some packets to unreachable dests.
> > 
> > Hmm... thanks for pointing that out.
> > 
> > But thats confusing.  If thats true, why does fib_detect_death
> > even exist?  If we skip unreachables, they cannot become
> > reachable again...?
> 
> 	No, we return them one by one. If we direct traffic
> to some alternative GW its state may switch to reachable. If
> this GW is unreachable, the next try will select next alternative
> GW. The algorithm tries to probe the unreachable GWs until
> the traffic causes successful ARP resolving and selecting
> one GW as reachable.

Doesn't work for me (or I misunderstand). See below.

> 	I decode fib_detect_death and fib_select_default as follows:

[Description of algorithm]

> 2. prefer the first NUD_REACHABLE or NUD_VALID that is not tb_default.
> I see the following cases:
> 
> 2.1. NUD_REACHABLE is before another NUD_VALID => use only the 
> NUD_REACHABLE

Yes, and thats what I don't understand.

> 	You mean:
> 
> - NUD_FAILED/NUD_NONE: no timer is pending, no current probing,
> we can direct traffic that will set NUD_INCOMPLETE and will
> send ARP probe
> 
> - NUD_INCOMPLETE: GW is currently probed, do not use it for
> round-robin, direct traffic to another NUD_FAILED/NUD_NONE GW

Yes, thats what I meant.

> 	Still, above states should be considered "unreachable",
> the difference will be only for the round-robin algorithm
> when all GWs are unreachable.

> 	But what do you think about above fib_detect_death
> logic? I think, your tests fail just because ip route get
> does not trigger traffic that will cause ARP resolution and
> because fib_select_default is buggy to not check slen and
> especially the tos.

Mostly, yes.

> > So with
> > ip route add tos 0x00 via 192.168.1.1
> > ip route add tos 0x04 via 192.168.1.2
> > ip route add tos 0x08 via 192.168.1.3
> > ip route add tos 0x0c via 192.168.1.4
> > ip route add tos 0x10 via 192.168.1.5
> > 
> > and ping -Q 1.2.3.4
> > 
> > if 1.5 is unreachable, we pick 1.1 (tos 0x0).
> > 
> > Others are not attempted.
> 
> 	Yep
> 
> > Do you consider that 'correct'?  I don't see why
> > we can use tos 0x00 but not e.g. 0x04 in that case.
> 
> 	It is illegal to return alternative routes that should
> be selected for another TOS, 192.168.1.2 should be selected
> only for packets with TOS 0x04. And TOS 0x00 in packets can not
> be matched by definition, it is used as 'match any TOS' value.

Ok.

> 	Also, without TOS restriction the 'if (next_fi != res->fi)'
> check will avoid alternatives for routes that have TOS with lower
> value, for example:
> 
> ip route add tos 0x20 via 1.1.1.1
> ip route add tos 0x20 via 2.2.2.2
> ip route add tos 0x10 via 3.3.3.3
> ip route add tos 0x10 via 4.4.4.4
> ip route add tos 0x00 via 5.5.5.5
> 
> 	fib_select_default from recent kernels always starts
> from fa_head pointing to 1.1.1.1, so if res->fi points to the
> same place the alternatives for TOS 0x20 will be considered.
> 
> 	But if res->fi selects 3.3.3.3 for TOS 0x10 the
> 'if (next_fi != res->fi)' breaks and the 4.4.4.4 alternative
> is ignored => tb_default=-1.
> 
> 	In your example in first email, TOS 0x10 does
> round-robin because it is the highest TOS and its alt
> routes are considered. Due to the bug in fib_select_default
> for lower TOS values no alt GWs are considered. That explains
> the behaviour you see.

Yes.

> > If .5 becomes available, we won't attempt it and still use 1.1/tos 0.
> 
> 	1.5 is before 1.1 in fa_head list, so it has priority
> among the reachable/valid routes.
> 
> 	The list with alternative routes should be:
> 
> - all routes with same TOS starting from the res->fi. The
> TOS check in my patch will skip routes with higher TOS value
> and then with lower TOS value that is not 0.
> 
> - if above TOS is not 0 we should consider all routes with TOS 0.
> In above example, we have 3 possible lists of alt routes:
> 
> - IP.TOS 0x20: 1.1.1.1, 2.2.2.2, 5.5.5.5
> - IP.TOS 0x10: 3.3.3.3, 4.4.4.4, 5.5.5.5
> - other TOS: 5.5.5.5
> 
> 	The funny part is that tb->tb_default is one
> and can't remember the order for every list. May be
> tb_default should be moved to fa (the first one for TOS).
> In above example, the first fa is 1.1.1.1, 3.3.3.3
> and 5.5.5.5.
> 
> 	So, I think, there is no big reason to change
> fib_detect_death. But fib_select_default needs to filter
> by slen and TOS and may be to move tb_default to fa,
> in case one wants to use TOS in alt routes.

Let me give you an example why I still think that fib_detect_death
is not optimal.

Test VM is running net-next with your proposed patch on top.

Its much better, the weird round-robin behaviour reported is gone.

The VM has two interfaces,
eth0, 192.168.7.10
eth1, 192.168.8.10

ip route del default
ip route add tos 0x0 via 192.168.7.1
ip route add tos 0x10 via 192.168.8.2

7.1 is reachable via eth0 (7.10/24)
8.2 *should* be rechable via eth1 (8.10/24)

I say *should* because I deliberately deleted this address from gateway
connected to that interface.

Now, I run

ping -Q 0x10 192.168.0.7

Packets get sent via tos 0x0.  So far, so good.

Now I add back 192.168.8.2.

But no probe takes place so packets continue to be sent via tos 0 route.

neigh_lookup returns NULL -> state is NUD_NONE -> gw is deemed unreachable

If I ping in the other direction from 192.168.8.2 packets are steered to
the 8.2/tos 0x10 gateway.

ip neigh flush nud reachable
on the test vm also makes packets take the tos 0x10 gateway.

The reverse (.8.2 available, packets sent via tos 0x10 use it, 8.2
address is removed) works: after a few seconds entry moves to STALE,
then FAILED, packets get sent via tos 0x0 gateway.

In any case, I don't want to hold this up, your proposed patch fixes
the TOS-0x10-always-round-robins issue.

Thanks for your detailed analysis & help with this.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 -next] net: fib: use fib result when zero-length prefix aliases exist
  2015-07-19 23:03       ` Florian Westphal
@ 2015-07-20  7:49         ` Julian Anastasov
  2015-07-20 11:15           ` Florian Westphal
  0 siblings, 1 reply; 8+ messages in thread
From: Julian Anastasov @ 2015-07-20  7:49 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Alexander Duyck


	Hello,

On Mon, 20 Jul 2015, Florian Westphal wrote:

> > 	You mean:
> > 
> > - NUD_FAILED/NUD_NONE: no timer is pending, no current probing,
> > we can direct traffic that will set NUD_INCOMPLETE and will
> > send ARP probe
> > 
> > - NUD_INCOMPLETE: GW is currently probed, do not use it for
> > round-robin, direct traffic to another NUD_FAILED/NUD_NONE GW
> 
> Yes, thats what I meant.

	Then, may be something like diff below?

> Let me give you an example why I still think that fib_detect_death
> is not optimal.
> 
> Test VM is running net-next with your proposed patch on top.
> 
> Its much better, the weird round-robin behaviour reported is gone.
> 
> The VM has two interfaces,
> eth0, 192.168.7.10
> eth1, 192.168.8.10
> 
> ip route del default
> ip route add tos 0x0 via 192.168.7.1
> ip route add tos 0x10 via 192.168.8.2
> 
> 7.1 is reachable via eth0 (7.10/24)
> 8.2 *should* be rechable via eth1 (8.10/24)
> 
> I say *should* because I deliberately deleted this address from gateway
> connected to that interface.
> 
> Now, I run
> 
> ping -Q 0x10 192.168.0.7
> 
> Packets get sent via tos 0x0.  So far, so good.
> 
> Now I add back 192.168.8.2.
> 
> But no probe takes place so packets continue to be sent via tos 0 route.
> 
> neigh_lookup returns NULL -> state is NUD_NONE -> gw is deemed unreachable

	So, this is a case when configuration changes,
for example, simple change in list of routes is not noticed?
The best option would be to keep genid value together with
tb_default but if we move this field to fa it becomes costly
and not worth it.

	May be we can just play with the state as already
discussed? Example separate diff (to be ported for net-next).

ipv4: be more aggressive when probing alternative gateways

Currently, we do not notice if new alternative gateways
are added. We can do it by checking for present neigh
entry. Also, gateways that are currently probed (NUD_INCOMPLETE)
can be skipped from round-robin probing.

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c7358ea..cd1732a 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -421,13 +421,15 @@ static int fib_detect_death(struct fib_info *fi, int order,
 	if (n) {
 		state = n->nud_state;
 		neigh_release(n);
+	} else {
+		return 0;
 	}
 	if (state == NUD_REACHABLE)
 		return 0;
 	if ((state & NUD_VALID) && order != dflt)
 		return 0;
 	if ((state & NUD_VALID) ||
-	    (*last_idx < 0 && order > dflt)) {
+	    (*last_idx < 0 && order > dflt && state != NUD_INCOMPLETE)) {
 		*last_resort = fi;
 		*last_idx = order;
 	}

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 -next] net: fib: use fib result when zero-length prefix aliases exist
  2015-07-20  7:49         ` Julian Anastasov
@ 2015-07-20 11:15           ` Florian Westphal
  2015-07-20 21:25             ` Julian Anastasov
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Westphal @ 2015-07-20 11:15 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Florian Westphal, netdev, Alexander Duyck

Julian Anastasov <ja@ssi.bg> wrote:
> On Mon, 20 Jul 2015, Florian Westphal wrote:
> > The VM has two interfaces,
> > eth0, 192.168.7.10
> > eth1, 192.168.8.10
> > 
> > ip route del default
> > ip route add tos 0x0 via 192.168.7.1
> > ip route add tos 0x10 via 192.168.8.2
> > 
> > 7.1 is reachable via eth0 (7.10/24)
> > 8.2 *should* be rechable via eth1 (8.10/24)
> > 
> > I say *should* because I deliberately deleted this address from gateway
> > connected to that interface.
> > 
> > Now, I run
> > 
> > ping -Q 0x10 192.168.0.7
> > 
> > Packets get sent via tos 0x0.  So far, so good.
> > 
> > Now I add back 192.168.8.2.
> > 
> > But no probe takes place so packets continue to be sent via tos 0 route.
> > 
> > neigh_lookup returns NULL -> state is NUD_NONE -> gw is deemed unreachable
> 
> 	So, this is a case when configuration changes,
> for example, simple change in list of routes is not noticed?

There are no changes on the test system, the address 192.168.8.2 is
added back to the gateway only.

> 	May be we can just play with the state as already
> discussed? Example separate diff (to be ported for net-next).
> 
> ipv4: be more aggressive when probing alternative gateways
> 
> Currently, we do not notice if new alternative gateways
> are added. We can do it by checking for present neigh
> entry. Also, gateways that are currently probed (NUD_INCOMPLETE)
> can be skipped from round-robin probing.
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index c7358ea..cd1732a 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -421,13 +421,15 @@ static int fib_detect_death(struct fib_info *fi, int order,
>  	if (n) {
>  		state = n->nud_state;
>  		neigh_release(n);
> +	} else {
> +		return 0;
>  	}
>  	if (state == NUD_REACHABLE)
>  		return 0;
>  	if ((state & NUD_VALID) && order != dflt)
>  		return 0;
>  	if ((state & NUD_VALID) ||
> -	    (*last_idx < 0 && order > dflt)) {
> +	    (*last_idx < 0 && order > dflt && state != NUD_INCOMPLETE)) {
>  		*last_resort = fi;
>  		*last_idx = order;
>  	}

I think its a good idea.

Could you please submit your two patches (tos fix and this one)
formally?

Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 -next] net: fib: use fib result when zero-length prefix aliases exist
  2015-07-20 11:15           ` Florian Westphal
@ 2015-07-20 21:25             ` Julian Anastasov
  0 siblings, 0 replies; 8+ messages in thread
From: Julian Anastasov @ 2015-07-20 21:25 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Alexander Duyck


	Hello,

On Mon, 20 Jul 2015, Florian Westphal wrote:

> Could you please submit your two patches (tos fix and this one)
> formally?

	In a day or two, I have to extend the patch and
do some tests...

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-07-20 21:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-17 17:37 [PATCH v2 -next] net: fib: use fib result when zero-length prefix aliases exist Florian Westphal
2015-07-17 22:47 ` Julian Anastasov
2015-07-18 18:05   ` Florian Westphal
2015-07-19 11:43     ` Julian Anastasov
2015-07-19 23:03       ` Florian Westphal
2015-07-20  7:49         ` Julian Anastasov
2015-07-20 11:15           ` Florian Westphal
2015-07-20 21:25             ` Julian Anastasov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).