From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Ahern Subject: Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops Date: Thu, 24 Mar 2016 20:05:07 -0600 Message-ID: <56F49CD3.1060406@cumulusnetworks.com> References: <1458833154-39091-1-git-send-email-dsa@cumulusnetworks.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org To: Julian Anastasov Return-path: Received: from mail-ig0-f170.google.com ([209.85.213.170]:33502 "EHLO mail-ig0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751588AbcCYCFM (ORCPT ); Thu, 24 Mar 2016 22:05:12 -0400 Received: by mail-ig0-f170.google.com with SMTP id ig19so5864903igb.0 for ; Thu, 24 Mar 2016 19:05:11 -0700 (PDT) In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On 3/24/16 4:33 PM, Julian Anastasov wrote: > But for multipath routes we can also consider the > nexthops as "alternatives", so it depends on how one uses > the multipath mechanism. The ability to fallback to > another nexthop assumes one connection is allowed to > move from one ISP to another. What if the second ISP > decides to reject the connection? What we have is a > broken connection just because the retransmits > were diverted to wrong place in the hurry. So, the > nexthops can be compatible or incompatible. For your > setup they are, for others they are not. I am not sure I completely understand your point. Are you saying that within a single multipath route some connections to nexthops are allowed and others are not? So to put that paragraph into an example 15.0.0.0/16 nexthop via 12.0.0.2 dev swp1 weight 1 nexthop via 12.0.0.3 dev swp1 weight 1 Hosts from 15.0/16 could have TCP connections use 12.0.0.2, but not 12.0.0.3 because 12.0.0.3 could be a different ISP and not allow TCP connections from this address space? > > For multipath setups I recall for the following > possible cases: > > 1. Every packet from connection hits multipath route > and what we want is the connection to use only one > nexthop. If nexthop fails then the connection fails. > Latest changes for the multipath algorithm chose this > option: use hash to map traffic to nexthop and any > fallbacks to another nexthop are not allowed because > we should follow the hash mapping. Works when nexthops > use different ISPs. > > 2. Only the first packet hits multipath route. With > the help from CONNMARK all next packets from connection > use the same nexthop by using another unicast route. > The goal here is the multipath route to be used just to > balance connections. This works best when the nexthop > selection was random and not a hash based. This was > how the previous algorithm worked. Fallbacks are > desired because it is fine to select alive nexthop > for the first packet, but wrong for the next packets. > > My preference was for the random selection, > I don't know why we restricted the new algorithm. May be > because in the common case when just a single default > multipath route is created we want to use all nexthops > in a ideal world where all nexthops are alive. > > So, if the kernel used a random selection > your fallback algorithm should help. But it is fragile > for the simple setup with single default multipath route. > May be what we miss is the ability to choose between > random and hash-based selection. Then your patch may be > useful but only for setup 2 (multipath route hit only by > first packet). So, your patch may come with a sysctl var > that explains your current patch logic: "avoid failed nexthops, > never probe them, wait their failed entry to be expired by > neigh_periodic_work and just then we can use the nexthop > by creating new entry to probe the GW". Who will trigger > probes often enough to maintain fresh state? First packet out does the probe -- neigh lookup fails, kernel has no knowledge so can't reject the nexthop based on neighbor information. After that if it has information that says that a nexthop is dead, why would it continue to try to probe? Any traffic that selects that nh is dead. That to me defies the basis of having multiple paths. > > Also, one may argue that such decisions should > be done in user space. It is common the direct routers > to answer ARP even while their uplink is down. In > the common case, one may need to ping 2-3 indirect > gateways to decide if a path is alive and then to recreate > the default multipath route. > > More comments below... > >> Signed-off-by: David Ahern >> --- >> net/ipv4/fib_semantics.c | 34 ++++++++++++++++++++++++++++++++-- >> 1 file changed, 32 insertions(+), 2 deletions(-) >> >> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c >> index d97268e8ff10..28fc6700c2b1 100644 >> --- a/net/ipv4/fib_semantics.c >> +++ b/net/ipv4/fib_semantics.c >> @@ -1563,13 +1563,43 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags) >> void fib_select_multipath(struct fib_result *res, int hash) >> { >> struct fib_info *fi = res->fi; >> + struct neighbour *n; >> + int state; >> >> for_nexthops(fi) { >> if (hash > atomic_read(&nh->nh_upper_bound)) >> continue; >> >> - res->nh_sel = nhsel; >> - return; >> + state = NUD_NONE; >> + n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev); > > Sometimes nh_gw is a local IP, you can call > neigh_lookup (or a lockless RCU-BH variant) only for the > nh_scope == RT_SCOPE_LINK case, just like in fib_select_default. got it, will change. > >> + if (n) { >> + state = n->nud_state; >> + neigh_release(n); >> + } >> + if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) { > > NUD_REACHABLE is part of NUD_VALID, so this is shorter: > > if (!n || (state & NUD_VALID)) { sure. > >> + res->nh_sel = nhsel; >> + return; >> + } >> + } endfor_nexthops(fi); >> + >> + /* try the nexthops again, but covering the entries >> + * skipped by the hash >> + */ >> + fi = res->fi; >> + for_nexthops(fi) { >> + if (hash <= atomic_read(&nh->nh_upper_bound)) > > This is dangerous, we can try a RTNH_F_DEAD entry > with nh_upper_bound = -1. This can work with 2 checks: In v2 of my patch I dropped this change as it technically is not needed for the case I am trying to solve. > > if (hash <= atomic_read(&nh->nh_upper_bound)) > break; > if (atomic_read(&nh->nh_upper_bound) < 0) > continue; > >> + continue; >> + >> + state = NUD_NONE; >> + n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev); >> + if (n) { >> + state = n->nud_state; >> + neigh_release(n); >> + } >> + if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) { >> + res->nh_sel = nhsel; >> + return; >> + } >> } endfor_nexthops(fi); > > If all are failed why not use the nexthop that was > selected by the hash? Even if it is failed, new ARP probe > can succeed. sure. Thanks for the comments. David