From mboxrd@z Thu Jan  1 00:00:00 1970
From: David Ahern <dsa@cumulusnetworks.com>
Subject: Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable
 nexthops
Date: Thu, 24 Mar 2016 20:05:07 -0600
Message-ID: <56F49CD3.1060406@cumulusnetworks.com>
References: <1458833154-39091-1-git-send-email-dsa@cumulusnetworks.com>
 <alpine.LFD.2.11.1603242201200.1795@ja.home.ssi.bg>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Cc: netdev@vger.kernel.org
To: Julian Anastasov <ja@ssi.bg>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-ig0-f170.google.com ([209.85.213.170]:33502 "EHLO
	mail-ig0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751588AbcCYCFM (ORCPT
	<rfc822;netdev@vger.kernel.org>); Thu, 24 Mar 2016 22:05:12 -0400
Received: by mail-ig0-f170.google.com with SMTP id ig19so5864903igb.0
        for <netdev@vger.kernel.org>; Thu, 24 Mar 2016 19:05:11 -0700 (PDT)
In-Reply-To: <alpine.LFD.2.11.1603242201200.1795@ja.home.ssi.bg>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On 3/24/16 4:33 PM, Julian Anastasov wrote:
> 	But for multipath routes we can also consider the
> nexthops as "alternatives", so it depends on how one uses
> the multipath mechanism. The ability to fallback to
> another nexthop assumes one connection is allowed to
> move from one ISP to another. What if the second ISP
> decides to reject the connection? What we have is a
> broken connection just because the retransmits
> were diverted to wrong place in the hurry. So, the
> nexthops can be compatible or incompatible. For your
> setup they are, for others they are not.

I am not sure I completely understand your point. Are you saying that 
within a single multipath route some connections to nexthops are allowed 
and others are not?

So to put that paragraph into an example

15.0.0.0/16
	nexthop via 12.0.0.2  dev swp1 weight 1
	nexthop via 12.0.0.3  dev swp1 weight 1

Hosts from 15.0/16 could have TCP connections use 12.0.0.2, but not 
12.0.0.3 because 12.0.0.3 could be a different ISP and not allow TCP 
connections from this address space?


>
> 	For multipath setups I recall for the following
> possible cases:
>
> 1. Every packet from connection hits multipath route
> and what we want is the connection to use only one
> nexthop. If nexthop fails then the connection fails.
> Latest changes for the multipath algorithm chose this
> option: use hash to map traffic to nexthop and any
> fallbacks to another nexthop are not allowed because
> we should follow the hash mapping. Works when nexthops
> use different ISPs.
>
> 2. Only the first packet hits multipath route. With
> the help from CONNMARK all next packets from connection
> use the same nexthop by using another unicast route.
> The goal here is the multipath route to be used just to
> balance connections. This works best when the nexthop
> selection was random and not a hash based. This was
> how the previous algorithm worked. Fallbacks are
> desired because it is fine to select alive nexthop
> for the first packet, but wrong for the next packets.
>
> 	My preference was for the random selection,
> I don't know why we restricted the new algorithm. May be
> because in the common case when just a single default
> multipath route is created we want to use all nexthops
> in a ideal world where all nexthops are alive.
>
> 	So, if the kernel used a random selection
> your fallback algorithm should help. But it is fragile
> for the simple setup with single default multipath route.
> May be what we miss is the ability to choose between
> random and hash-based selection. Then your patch may be
> useful but only for setup 2 (multipath route hit only by
> first packet). So, your patch may come with a sysctl var
> that explains your current patch logic: "avoid failed nexthops,
> never probe them, wait their failed entry to be expired by
> neigh_periodic_work and just then we can use the nexthop
> by creating new entry to probe the GW". Who will trigger
> probes often enough to maintain fresh state?

First packet out does the probe -- neigh lookup fails, kernel has no 
knowledge so can't reject the nexthop based on neighbor information.

After that if it has information that says that a nexthop is dead, why 
would it continue to try to probe? Any traffic that selects that nh is 
dead. That to me defies the basis of having multiple paths.

>
> 	Also, one may argue that such decisions should
> be done in user space. It is common the direct routers
> to answer ARP even while their uplink is down. In
> the common case, one may need to ping 2-3 indirect
> gateways to decide if a path is alive and then to recreate
> the default multipath route.
>
> 	More comments below...
>
>> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
>> ---
>>   net/ipv4/fib_semantics.c | 34 ++++++++++++++++++++++++++++++++--
>>   1 file changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
>> index d97268e8ff10..28fc6700c2b1 100644
>> --- a/net/ipv4/fib_semantics.c
>> +++ b/net/ipv4/fib_semantics.c
>> @@ -1563,13 +1563,43 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
>>   void fib_select_multipath(struct fib_result *res, int hash)
>>   {
>>   	struct fib_info *fi = res->fi;
>> +	struct neighbour *n;
>> +	int state;
>>
>>   	for_nexthops(fi) {
>>   		if (hash > atomic_read(&nh->nh_upper_bound))
>>   			continue;
>>
>> -		res->nh_sel = nhsel;
>> -		return;
>> +		state = NUD_NONE;
>> +		n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
>
> 	Sometimes nh_gw is a local IP, you can call
> neigh_lookup (or a lockless RCU-BH variant) only for the
> nh_scope == RT_SCOPE_LINK case, just like in fib_select_default.

got it, will change.

>
>> +		if (n) {
>> +			state = n->nud_state;
>> +			neigh_release(n);
>> +		}
>> +		if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) {
>
> 	NUD_REACHABLE is part of NUD_VALID, so this is shorter:
>
> 	if (!n || (state & NUD_VALID)) {

sure.


>
>> +			res->nh_sel = nhsel;
>> +			return;
>> +		}
>> +	} endfor_nexthops(fi);
>> +
>> +	/* try the nexthops again, but covering the entries
>> +	 * skipped by the hash
>> +	 */
>> +	fi = res->fi;
>> +	for_nexthops(fi) {
>> +		if (hash <= atomic_read(&nh->nh_upper_bound))
>
> 	This is dangerous, we can try a RTNH_F_DEAD entry
> with nh_upper_bound = -1. This can work with 2 checks:

In v2 of my patch I dropped this change as it technically is not needed 
for the case I am trying to solve.

>
> 	if (hash <= atomic_read(&nh->nh_upper_bound))
> 		break;
> 	if (atomic_read(&nh->nh_upper_bound) < 0)
> 		continue;
>
>> +			continue;
>> +
>> +		state = NUD_NONE;
>> +		n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
>> +		if (n) {
>> +			state = n->nud_state;
>> +			neigh_release(n);
>> +		}
>> +		if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) {
>> +			res->nh_sel = nhsel;
>> +			return;
>> +		}
>>   	} endfor_nexthops(fi);
>
> 	If all are failed why not use the nexthop that was
> selected by the hash? Even if it is failed, new ARP probe
> can succeed.

sure.

Thanks for the comments.

David