[PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops
@ 2016-03-24 15:25 David Ahern
  2016-03-24 16:45 ` Eric Dumazet
  2016-03-24 22:33 ` Julian Anastasov
  0 siblings, 2 replies; 9+ messages in thread
From: David Ahern @ 2016-03-24 15:25 UTC (permalink / raw)
  To: netdev

Multipath route lookups should consider knowledge about next hops and not
select a hop that is known to be failed.

Example:

                 [h2]                   [h3]   15.0.0.5
                  |                      |
                 3|                     3|
                [SP1]                  [SP2]--+
                 1  2                   1     2
                 |  |     /-------------+     |
                 |   \   /                    |
                 |     X                      |
                 |    / \                     |
                 |   /   \---------------\    |
                 1  2                     1   2
     12.0.0.2  [TOR1] 3-----------------3 [TOR2] 12.0.0.3
                 4                         4
                  \                       /
                    \                    /
                     \                  /
                      -------|   |-----/
                             1   2
                            [TOR3]
                              3|
                               |
                              [h1]  12.0.0.1

host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:

root@h1:~# ip ro ls
...
12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
15.0.0.0/16
	nexthop via 12.0.0.2  dev swp1 weight 1
	nexthop via 12.0.0.3  dev swp1 weight 1
...

If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:

root@h1:~# ip neigh show
...
12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
12.0.0.2 dev swp1  FAILED
...

The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookups fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does for some single path cases.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 net/ipv4/fib_semantics.c | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index d97268e8ff10..28fc6700c2b1 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1563,13 +1563,43 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
 void fib_select_multipath(struct fib_result *res, int hash)
 {
 	struct fib_info *fi = res->fi;
+	struct neighbour *n;
+	int state;
 
 	for_nexthops(fi) {
 		if (hash > atomic_read(&nh->nh_upper_bound))
 			continue;
 
-		res->nh_sel = nhsel;
-		return;
+		state = NUD_NONE;
+		n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
+		if (n) {
+			state = n->nud_state;
+			neigh_release(n);
+		}
+		if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) {
+			res->nh_sel = nhsel;
+			return;
+		}
+	} endfor_nexthops(fi);
+
+	/* try the nexthops again, but covering the entries
+	 * skipped by the hash
+	 */
+	fi = res->fi;
+	for_nexthops(fi) {
+		if (hash <= atomic_read(&nh->nh_upper_bound))
+			continue;
+
+		state = NUD_NONE;
+		n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
+		if (n) {
+			state = n->nud_state;
+			neigh_release(n);
+		}
+		if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) {
+			res->nh_sel = nhsel;
+			return;
+		}
 	} endfor_nexthops(fi);
 
 	/* Race condition: route has just become dead. */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops
  2016-03-24 15:25 [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops David Ahern
@ 2016-03-24 16:45 ` Eric Dumazet
  2016-03-24 17:43   ` David Ahern
  2016-03-24 18:26   ` David Miller
  2016-03-24 22:33 ` Julian Anastasov
  1 sibling, 2 replies; 9+ messages in thread
From: Eric Dumazet @ 2016-03-24 16:45 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev

On Thu, 2016-03-24 at 08:25 -0700, David Ahern wrote:
> Multipath route lookups should consider knowledge about next hops and not
> select a hop that is known to be failed.

Does not look a net candidate to me.

> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> ---
>  net/ipv4/fib_semantics.c | 34 ++++++++++++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index d97268e8ff10..28fc6700c2b1 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -1563,13 +1563,43 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
>  void fib_select_multipath(struct fib_result *res, int hash)
>  {
>  	struct fib_info *fi = res->fi;
> +	struct neighbour *n;
> +	int state;
>  
>  	for_nexthops(fi) {
>  		if (hash > atomic_read(&nh->nh_upper_bound))
>  			continue;
>  
> -		res->nh_sel = nhsel;
> -		return;
> +		state = NUD_NONE;
> +		n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
> +		if (n) {
> +			state = n->nud_state;
> +			neigh_release(n);
> +		}

This looks like something that could use RCU to avoid expensive
refcounting ?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops
  2016-03-24 16:45 ` Eric Dumazet
@ 2016-03-24 17:43   ` David Ahern
  2016-03-24 17:55     ` Eric Dumazet
  2016-03-24 18:26   ` David Miller
  1 sibling, 1 reply; 9+ messages in thread
From: David Ahern @ 2016-03-24 17:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 3/24/16 10:45 AM, Eric Dumazet wrote:
> On Thu, 2016-03-24 at 08:25 -0700, David Ahern wrote:
>> Multipath route lookups should consider knowledge about next hops and not
>> select a hop that is known to be failed.
>
> Does not look a net candidate to me.

you don't consider this a bug? certainly not a feature.

>
>> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
>> ---
>>   net/ipv4/fib_semantics.c | 34 ++++++++++++++++++++++++++++++++--
>>   1 file changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
>> index d97268e8ff10..28fc6700c2b1 100644
>> --- a/net/ipv4/fib_semantics.c
>> +++ b/net/ipv4/fib_semantics.c
>> @@ -1563,13 +1563,43 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
>>   void fib_select_multipath(struct fib_result *res, int hash)
>>   {
>>   	struct fib_info *fi = res->fi;
>> +	struct neighbour *n;
>> +	int state;
>>
>>   	for_nexthops(fi) {
>>   		if (hash > atomic_read(&nh->nh_upper_bound))
>>   			continue;
>>
>> -		res->nh_sel = nhsel;
>> -		return;
>> +		state = NUD_NONE;
>> +		n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
>> +		if (n) {
>> +			state = n->nud_state;
>> +			neigh_release(n);
>> +		}
>
> This looks like something that could use RCU to avoid expensive
> refcounting ?

Yes, good point.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops
  2016-03-24 17:43   ` David Ahern
@ 2016-03-24 17:55     ` Eric Dumazet
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2016-03-24 17:55 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev

On Thu, 2016-03-24 at 11:43 -0600, David Ahern wrote:
> On 3/24/16 10:45 AM, Eric Dumazet wrote:
> > On Thu, 2016-03-24 at 08:25 -0700, David Ahern wrote:
> >> Multipath route lookups should consider knowledge about next hops and not
> >> select a hop that is known to be failed.
> >
> > Does not look a net candidate to me.
> 
> you don't consider this a bug? certainly not a feature.

For a bug that lasted years, certainly the fix can be properly tested in
net-next, then eventually backported to stable once validated.

Unless there is a sudden concern about known ways to crash linux
hosts ;)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops
  2016-03-24 16:45 ` Eric Dumazet
  2016-03-24 17:43   ` David Ahern
@ 2016-03-24 18:26   ` David Miller
  1 sibling, 0 replies; 9+ messages in thread
From: David Miller @ 2016-03-24 18:26 UTC (permalink / raw)
  To: eric.dumazet; +Cc: dsa, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 24 Mar 2016 09:45:34 -0700

> On Thu, 2016-03-24 at 08:25 -0700, David Ahern wrote:
>> Multipath route lookups should consider knowledge about next hops and not
>> select a hop that is known to be failed.
> 
> Does not look a net candidate to me.
> 
>> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
>> ---
>>  net/ipv4/fib_semantics.c | 34 ++++++++++++++++++++++++++++++++--
>>  1 file changed, 32 insertions(+), 2 deletions(-)
>> 
>> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
>> index d97268e8ff10..28fc6700c2b1 100644
>> --- a/net/ipv4/fib_semantics.c
>> +++ b/net/ipv4/fib_semantics.c
>> @@ -1563,13 +1563,43 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
>>  void fib_select_multipath(struct fib_result *res, int hash)
>>  {
>>  	struct fib_info *fi = res->fi;
>> +	struct neighbour *n;
>> +	int state;
>>  
>>  	for_nexthops(fi) {
>>  		if (hash > atomic_read(&nh->nh_upper_bound))
>>  			continue;
>>  
>> -		res->nh_sel = nhsel;
>> -		return;
>> +		state = NUD_NONE;
>> +		n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
>> +		if (n) {
>> +			state = n->nud_state;
>> +			neigh_release(n);
>> +		}
> 
> This looks like something that could use RCU to avoid expensive
> refcounting ?

Indeed, this is way too expensive as-is.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops
  2016-03-24 15:25 [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops David Ahern
  2016-03-24 16:45 ` Eric Dumazet
@ 2016-03-24 22:33 ` Julian Anastasov
  2016-03-25  2:05   ` David Ahern
  1 sibling, 1 reply; 9+ messages in thread
From: Julian Anastasov @ 2016-03-24 22:33 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev


	Hello,

On Thu, 24 Mar 2016, David Ahern wrote:

> Multipath route lookups should consider knowledge about next hops and not
> select a hop that is known to be failed.

...

> The failed path can be avoided by considering known neighbor information
> when selecting next hops. If the neighbor lookups fails we have no
> knowledge about the nexthop, so give it a shot. If there is an entry
> then only select the nexthop if the state is sane. This is similar to
> what fib_detect_death does for some single path cases.

	fib_detect_death works with alternatives and
the neighbor status is used as the only present info
in kernel.

	But for multipath routes we can also consider the
nexthops as "alternatives", so it depends on how one uses
the multipath mechanism. The ability to fallback to
another nexthop assumes one connection is allowed to
move from one ISP to another. What if the second ISP
decides to reject the connection? What we have is a
broken connection just because the retransmits
were diverted to wrong place in the hurry. So, the
nexthops can be compatible or incompatible. For your
setup they are, for others they are not.

	For multipath setups I recall for the following
possible cases:

1. Every packet from connection hits multipath route
and what we want is the connection to use only one
nexthop. If nexthop fails then the connection fails.
Latest changes for the multipath algorithm chose this
option: use hash to map traffic to nexthop and any
fallbacks to another nexthop are not allowed because
we should follow the hash mapping. Works when nexthops
use different ISPs.

2. Only the first packet hits multipath route. With
the help from CONNMARK all next packets from connection
use the same nexthop by using another unicast route.
The goal here is the multipath route to be used just to
balance connections. This works best when the nexthop
selection was random and not a hash based. This was
how the previous algorithm worked. Fallbacks are
desired because it is fine to select alive nexthop
for the first packet, but wrong for the next packets.

	My preference was for the random selection,
I don't know why we restricted the new algorithm. May be
because in the common case when just a single default
multipath route is created we want to use all nexthops
in a ideal world where all nexthops are alive.

	So, if the kernel used a random selection
your fallback algorithm should help. But it is fragile
for the simple setup with single default multipath route.
May be what we miss is the ability to choose between
random and hash-based selection. Then your patch may be
useful but only for setup 2 (multipath route hit only by
first packet). So, your patch may come with a sysctl var
that explains your current patch logic: "avoid failed nexthops,
never probe them, wait their failed entry to be expired by
neigh_periodic_work and just then we can use the nexthop
by creating new entry to probe the GW". Who will trigger
probes often enough to maintain fresh state?

	Also, one may argue that such decisions should
be done in user space. It is common the direct routers
to answer ARP even while their uplink is down. In
the common case, one may need to ping 2-3 indirect
gateways to decide if a path is alive and then to recreate
the default multipath route.

	More comments below...

> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> ---
>  net/ipv4/fib_semantics.c | 34 ++++++++++++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index d97268e8ff10..28fc6700c2b1 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -1563,13 +1563,43 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
>  void fib_select_multipath(struct fib_result *res, int hash)
>  {
>  	struct fib_info *fi = res->fi;
> +	struct neighbour *n;
> +	int state;
>  
>  	for_nexthops(fi) {
>  		if (hash > atomic_read(&nh->nh_upper_bound))
>  			continue;
>  
> -		res->nh_sel = nhsel;
> -		return;
> +		state = NUD_NONE;
> +		n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);

	Sometimes nh_gw is a local IP, you can call
neigh_lookup (or a lockless RCU-BH variant) only for the
nh_scope == RT_SCOPE_LINK case, just like in fib_select_default.

> +		if (n) {
> +			state = n->nud_state;
> +			neigh_release(n);
> +		}
> +		if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) {

	NUD_REACHABLE is part of NUD_VALID, so this is shorter:

	if (!n || (state & NUD_VALID)) {

> +			res->nh_sel = nhsel;
> +			return;
> +		}
> +	} endfor_nexthops(fi);
> +
> +	/* try the nexthops again, but covering the entries
> +	 * skipped by the hash
> +	 */
> +	fi = res->fi;
> +	for_nexthops(fi) {
> +		if (hash <= atomic_read(&nh->nh_upper_bound))

	This is dangerous, we can try a RTNH_F_DEAD entry
with nh_upper_bound = -1. This can work with 2 checks:

	if (hash <= atomic_read(&nh->nh_upper_bound))
		break;
	if (atomic_read(&nh->nh_upper_bound) < 0)
		continue;

> +			continue;
> +
> +		state = NUD_NONE;
> +		n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
> +		if (n) {
> +			state = n->nud_state;
> +			neigh_release(n);
> +		}
> +		if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) {
> +			res->nh_sel = nhsel;
> +			return;
> +		}
>  	} endfor_nexthops(fi);

	If all are failed why not use the nexthop that was
selected by the hash? Even if it is failed, new ARP probe
can succeed.

>  	/* Race condition: route has just become dead. */
> -- 
> 1.9.1

Regards

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops
  2016-03-24 22:33 ` Julian Anastasov
@ 2016-03-25  2:05   ` David Ahern
  2016-03-25  9:05     ` Julian Anastasov
  0 siblings, 1 reply; 9+ messages in thread
From: David Ahern @ 2016-03-25  2:05 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: netdev

On 3/24/16 4:33 PM, Julian Anastasov wrote:
> 	But for multipath routes we can also consider the
> nexthops as "alternatives", so it depends on how one uses
> the multipath mechanism. The ability to fallback to
> another nexthop assumes one connection is allowed to
> move from one ISP to another. What if the second ISP
> decides to reject the connection? What we have is a
> broken connection just because the retransmits
> were diverted to wrong place in the hurry. So, the
> nexthops can be compatible or incompatible. For your
> setup they are, for others they are not.

I am not sure I completely understand your point. Are you saying that 
within a single multipath route some connections to nexthops are allowed 
and others are not?

So to put that paragraph into an example

15.0.0.0/16
	nexthop via 12.0.0.2  dev swp1 weight 1
	nexthop via 12.0.0.3  dev swp1 weight 1

Hosts from 15.0/16 could have TCP connections use 12.0.0.2, but not 
12.0.0.3 because 12.0.0.3 could be a different ISP and not allow TCP 
connections from this address space?


>
> 	For multipath setups I recall for the following
> possible cases:
>
> 1. Every packet from connection hits multipath route
> and what we want is the connection to use only one
> nexthop. If nexthop fails then the connection fails.
> Latest changes for the multipath algorithm chose this
> option: use hash to map traffic to nexthop and any
> fallbacks to another nexthop are not allowed because
> we should follow the hash mapping. Works when nexthops
> use different ISPs.
>
> 2. Only the first packet hits multipath route. With
> the help from CONNMARK all next packets from connection
> use the same nexthop by using another unicast route.
> The goal here is the multipath route to be used just to
> balance connections. This works best when the nexthop
> selection was random and not a hash based. This was
> how the previous algorithm worked. Fallbacks are
> desired because it is fine to select alive nexthop
> for the first packet, but wrong for the next packets.
>
> 	My preference was for the random selection,
> I don't know why we restricted the new algorithm. May be
> because in the common case when just a single default
> multipath route is created we want to use all nexthops
> in a ideal world where all nexthops are alive.
>
> 	So, if the kernel used a random selection
> your fallback algorithm should help. But it is fragile
> for the simple setup with single default multipath route.
> May be what we miss is the ability to choose between
> random and hash-based selection. Then your patch may be
> useful but only for setup 2 (multipath route hit only by
> first packet). So, your patch may come with a sysctl var
> that explains your current patch logic: "avoid failed nexthops,
> never probe them, wait their failed entry to be expired by
> neigh_periodic_work and just then we can use the nexthop
> by creating new entry to probe the GW". Who will trigger
> probes often enough to maintain fresh state?

First packet out does the probe -- neigh lookup fails, kernel has no 
knowledge so can't reject the nexthop based on neighbor information.

After that if it has information that says that a nexthop is dead, why 
would it continue to try to probe? Any traffic that selects that nh is 
dead. That to me defies the basis of having multiple paths.

>
> 	Also, one may argue that such decisions should
> be done in user space. It is common the direct routers
> to answer ARP even while their uplink is down. In
> the common case, one may need to ping 2-3 indirect
> gateways to decide if a path is alive and then to recreate
> the default multipath route.
>
> 	More comments below...
>
>> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
>> ---
>>   net/ipv4/fib_semantics.c | 34 ++++++++++++++++++++++++++++++++--
>>   1 file changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
>> index d97268e8ff10..28fc6700c2b1 100644
>> --- a/net/ipv4/fib_semantics.c
>> +++ b/net/ipv4/fib_semantics.c
>> @@ -1563,13 +1563,43 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
>>   void fib_select_multipath(struct fib_result *res, int hash)
>>   {
>>   	struct fib_info *fi = res->fi;
>> +	struct neighbour *n;
>> +	int state;
>>
>>   	for_nexthops(fi) {
>>   		if (hash > atomic_read(&nh->nh_upper_bound))
>>   			continue;
>>
>> -		res->nh_sel = nhsel;
>> -		return;
>> +		state = NUD_NONE;
>> +		n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
>
> 	Sometimes nh_gw is a local IP, you can call
> neigh_lookup (or a lockless RCU-BH variant) only for the
> nh_scope == RT_SCOPE_LINK case, just like in fib_select_default.

got it, will change.

>
>> +		if (n) {
>> +			state = n->nud_state;
>> +			neigh_release(n);
>> +		}
>> +		if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) {
>
> 	NUD_REACHABLE is part of NUD_VALID, so this is shorter:
>
> 	if (!n || (state & NUD_VALID)) {

sure.


>
>> +			res->nh_sel = nhsel;
>> +			return;
>> +		}
>> +	} endfor_nexthops(fi);
>> +
>> +	/* try the nexthops again, but covering the entries
>> +	 * skipped by the hash
>> +	 */
>> +	fi = res->fi;
>> +	for_nexthops(fi) {
>> +		if (hash <= atomic_read(&nh->nh_upper_bound))
>
> 	This is dangerous, we can try a RTNH_F_DEAD entry
> with nh_upper_bound = -1. This can work with 2 checks:

In v2 of my patch I dropped this change as it technically is not needed 
for the case I am trying to solve.

>
> 	if (hash <= atomic_read(&nh->nh_upper_bound))
> 		break;
> 	if (atomic_read(&nh->nh_upper_bound) < 0)
> 		continue;
>
>> +			continue;
>> +
>> +		state = NUD_NONE;
>> +		n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
>> +		if (n) {
>> +			state = n->nud_state;
>> +			neigh_release(n);
>> +		}
>> +		if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) {
>> +			res->nh_sel = nhsel;
>> +			return;
>> +		}
>>   	} endfor_nexthops(fi);
>
> 	If all are failed why not use the nexthop that was
> selected by the hash? Even if it is failed, new ARP probe
> can succeed.

sure.

Thanks for the comments.

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops
  2016-03-25  2:05   ` David Ahern
@ 2016-03-25  9:05     ` Julian Anastasov
  2016-03-30  3:16       ` David Ahern
  0 siblings, 1 reply; 9+ messages in thread
From: Julian Anastasov @ 2016-03-25  9:05 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev


	Hello,

On Thu, 24 Mar 2016, David Ahern wrote:

> On 3/24/16 4:33 PM, Julian Anastasov wrote:
> > 	But for multipath routes we can also consider the
> > nexthops as "alternatives", so it depends on how one uses
> > the multipath mechanism. The ability to fallback to
> > another nexthop assumes one connection is allowed to
> > move from one ISP to another. What if the second ISP
> > decides to reject the connection? What we have is a
> > broken connection just because the retransmits
> > were diverted to wrong place in the hurry. So, the
> > nexthops can be compatible or incompatible. For your
> > setup they are, for others they are not.
> 
> I am not sure I completely understand your point. Are you saying that within a
> single multipath route some connections to nexthops are allowed and others are
> not?
> 
> So to put that paragraph into an example
> 
> 15.0.0.0/16
> 	nexthop via 12.0.0.2  dev swp1 weight 1
> 	nexthop via 12.0.0.3  dev swp1 weight 1
> 
> Hosts from 15.0/16 could have TCP connections use 12.0.0.2, but not 12.0.0.3
> because 12.0.0.3 could be a different ISP and not allow TCP connections from
> this address space?

	Yes. Two cases are possible:

1. ISP2 filters saddr, traffic with saddr from ISP1 is dropped.

2. ISP2 allows any saddr. But how the responses from
world with daddr=IP_from_ISP1 will come from ISP2 link?
If the nexthops are for different ISP the connection
can survive only if sticks to its ISP. An ISP will
not work as a backup link for another ISP.

	The hash-based algorithm does not move connections
from one nexthop to another. If you rearrange the nexthops
on failure, the binding is lost and connections can break.
A fallback from fragile wifi link can upset users and any
redirects will lead to bad experience with broken conns.
So only CONNMARK can help to stick connection to some path.
If this path has multiple simultaneous alternatives you can
again use multipath route reached from 'ip rule from PUBIP2 table ISP2'
but then we again are restricted from the hash-based alg
which is suitable only for default routes hit by saddr=0.0.0.0
lookups. In the common case when connection is created
there are two lookups:

1. TCP selects nexthop with a saddr=0.0.0.0 lookup.
ISP is selected based on the mp alg.

2. If one is lucky to reach ip_route_me_harder the
hash-based lookup is defeated because now lookup
uses saddr=iph->saddr, so it selects different nexthop.
It works while all packets for the connection reach
ip_route_me_harder.

> > 	So, if the kernel used a random selection
> > your fallback algorithm should help. But it is fragile
> > for the simple setup with single default multipath route.
> > May be what we miss is the ability to choose between
> > random and hash-based selection. Then your patch may be
> > useful but only for setup 2 (multipath route hit only by
> > first packet). So, your patch may come with a sysctl var
> > that explains your current patch logic: "avoid failed nexthops,
> > never probe them, wait their failed entry to be expired by
> > neigh_periodic_work and just then we can use the nexthop
> > by creating new entry to probe the GW". Who will trigger
> > probes often enough to maintain fresh state?
> 
> First packet out does the probe -- neigh lookup fails, kernel has no knowledge
> so can't reject the nexthop based on neighbor information.
> 
> After that if it has information that says that a nexthop is dead, why would
> it continue to try to probe? Any traffic that selects that nh is dead. That to

	If entry becomes FAILED this state is preserved
if we do not direct traffic to this entry. If there was a
single connection that was rejected after 3 failed probes
the next connection (with your patch) will fallback to
another neigh and the first entry will remain in FAILED
state until expiration. If one wants to refresh the state
often, a script/tool that pings all GWs is needed, so that
you can notice the available or failed paths faster.

> me defies the basis of having multiple paths.

	We do not know how long is the outage. Long living
connections may prefer to survive with retransmits.
Say you are using SSH via wifi link doing important work.
Do you want your connection to break just because link was
down for a while?

	Fallback to other ISP can be unwanted. If we do not
know if multipath route is used per-packet ot per-connection
we can not just apply a fallback to other nexthops.
We can do it only if user can configure the different
usages per multipath route:

1. hash-based or random

2. allow fallback or not. This config is not a MUST if users
can select random mode, it can imply that fallback is allowed.

Regards

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops
  2016-03-25  9:05     ` Julian Anastasov
@ 2016-03-30  3:16       ` David Ahern
  0 siblings, 0 replies; 9+ messages in thread
From: David Ahern @ 2016-03-30  3:16 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: netdev

On 3/25/16 4:05 AM, Julian Anastasov wrote:
>
> 	Hello,
>
> On Thu, 24 Mar 2016, David Ahern wrote:
>
>> On 3/24/16 4:33 PM, Julian Anastasov wrote:
>>> 	But for multipath routes we can also consider the
>>> nexthops as "alternatives", so it depends on how one uses
>>> the multipath mechanism. The ability to fallback to
>>> another nexthop assumes one connection is allowed to
>>> move from one ISP to another. What if the second ISP
>>> decides to reject the connection? What we have is a
>>> broken connection just because the retransmits
>>> were diverted to wrong place in the hurry. So, the
>>> nexthops can be compatible or incompatible. For your
>>> setup they are, for others they are not.
>>
>> I am not sure I completely understand your point. Are you saying that within a
>> single multipath route some connections to nexthops are allowed and others are
>> not?
>>
>> So to put that paragraph into an example
>>
>> 15.0.0.0/16
>> 	nexthop via 12.0.0.2  dev swp1 weight 1
>> 	nexthop via 12.0.0.3  dev swp1 weight 1
>>
>> Hosts from 15.0/16 could have TCP connections use 12.0.0.2, but not 12.0.0.3
>> because 12.0.0.3 could be a different ISP and not allow TCP connections from
>> this address space?
>
> 	Yes. Two cases are possible:
>
> 1. ISP2 filters saddr, traffic with saddr from ISP1 is dropped.
>
> 2. ISP2 allows any saddr. But how the responses from
> world with daddr=IP_from_ISP1 will come from ISP2 link?
> If the nexthops are for different ISP the connection
> can survive only if sticks to its ISP. An ISP will
> not work as a backup link for another ISP.

Seems to me this is a problem that is addressed by VRFs, not multipath 
routes where some nexthops are actually deadends because they attempt to 
cross ISPs.


>> After that if it has information that says that a nexthop is dead, why would
>> it continue to try to probe? Any traffic that selects that nh is dead. That to
>
> 	If entry becomes FAILED this state is preserved
> if we do not direct traffic to this entry. If there was a
> single connection that was rejected after 3 failed probes
> the next connection (with your patch) will fallback to
> another neigh and the first entry will remain in FAILED
> state until expiration. If one wants to refresh the state
> often, a script/tool that pings all GWs is needed, so that
> you can notice the available or failed paths faster.
>
>> me defies the basis of having multiple paths.
>
> 	We do not know how long is the outage. Long living
> connections may prefer to survive with retransmits.
> Say you are using SSH via wifi link doing important work.
> Do you want your connection to break just because link was
> down for a while?

neighbor entries have a timeout and when it drops from the cache the arp 
will try again. This suggested patch is not saying 'never try a nexthop 
again' it is saying 'I have multiple paths and since path 1 is down try 
another one'.

I'll send an updated patch when I get time (traveling at the moment); I 
guess a sysctl is going to be needed if the behavior you mention with 
ISPs is reasonable.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-03-30  3:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-24 15:25 [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops David Ahern
2016-03-24 16:45 ` Eric Dumazet
2016-03-24 17:43   ` David Ahern
2016-03-24 17:55     ` Eric Dumazet
2016-03-24 18:26   ` David Miller
2016-03-24 22:33 ` Julian Anastasov
2016-03-25  2:05   ` David Ahern
2016-03-25  9:05     ` Julian Anastasov
2016-03-30  3:16       ` David Ahern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).