From: David Ahern <dsa@cumulusnetworks.com>
To: Julian Anastasov <ja@ssi.bg>
Cc: netdev@vger.kernel.org
Subject: Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops
Date: Thu, 24 Mar 2016 20:05:07 -0600 [thread overview]
Message-ID: <56F49CD3.1060406@cumulusnetworks.com> (raw)
In-Reply-To: <alpine.LFD.2.11.1603242201200.1795@ja.home.ssi.bg>
On 3/24/16 4:33 PM, Julian Anastasov wrote:
> But for multipath routes we can also consider the
> nexthops as "alternatives", so it depends on how one uses
> the multipath mechanism. The ability to fallback to
> another nexthop assumes one connection is allowed to
> move from one ISP to another. What if the second ISP
> decides to reject the connection? What we have is a
> broken connection just because the retransmits
> were diverted to wrong place in the hurry. So, the
> nexthops can be compatible or incompatible. For your
> setup they are, for others they are not.
I am not sure I completely understand your point. Are you saying that
within a single multipath route some connections to nexthops are allowed
and others are not?
So to put that paragraph into an example
15.0.0.0/16
nexthop via 12.0.0.2 dev swp1 weight 1
nexthop via 12.0.0.3 dev swp1 weight 1
Hosts from 15.0/16 could have TCP connections use 12.0.0.2, but not
12.0.0.3 because 12.0.0.3 could be a different ISP and not allow TCP
connections from this address space?
>
> For multipath setups I recall for the following
> possible cases:
>
> 1. Every packet from connection hits multipath route
> and what we want is the connection to use only one
> nexthop. If nexthop fails then the connection fails.
> Latest changes for the multipath algorithm chose this
> option: use hash to map traffic to nexthop and any
> fallbacks to another nexthop are not allowed because
> we should follow the hash mapping. Works when nexthops
> use different ISPs.
>
> 2. Only the first packet hits multipath route. With
> the help from CONNMARK all next packets from connection
> use the same nexthop by using another unicast route.
> The goal here is the multipath route to be used just to
> balance connections. This works best when the nexthop
> selection was random and not a hash based. This was
> how the previous algorithm worked. Fallbacks are
> desired because it is fine to select alive nexthop
> for the first packet, but wrong for the next packets.
>
> My preference was for the random selection,
> I don't know why we restricted the new algorithm. May be
> because in the common case when just a single default
> multipath route is created we want to use all nexthops
> in a ideal world where all nexthops are alive.
>
> So, if the kernel used a random selection
> your fallback algorithm should help. But it is fragile
> for the simple setup with single default multipath route.
> May be what we miss is the ability to choose between
> random and hash-based selection. Then your patch may be
> useful but only for setup 2 (multipath route hit only by
> first packet). So, your patch may come with a sysctl var
> that explains your current patch logic: "avoid failed nexthops,
> never probe them, wait their failed entry to be expired by
> neigh_periodic_work and just then we can use the nexthop
> by creating new entry to probe the GW". Who will trigger
> probes often enough to maintain fresh state?
First packet out does the probe -- neigh lookup fails, kernel has no
knowledge so can't reject the nexthop based on neighbor information.
After that if it has information that says that a nexthop is dead, why
would it continue to try to probe? Any traffic that selects that nh is
dead. That to me defies the basis of having multiple paths.
>
> Also, one may argue that such decisions should
> be done in user space. It is common the direct routers
> to answer ARP even while their uplink is down. In
> the common case, one may need to ping 2-3 indirect
> gateways to decide if a path is alive and then to recreate
> the default multipath route.
>
> More comments below...
>
>> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
>> ---
>> net/ipv4/fib_semantics.c | 34 ++++++++++++++++++++++++++++++++--
>> 1 file changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
>> index d97268e8ff10..28fc6700c2b1 100644
>> --- a/net/ipv4/fib_semantics.c
>> +++ b/net/ipv4/fib_semantics.c
>> @@ -1563,13 +1563,43 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
>> void fib_select_multipath(struct fib_result *res, int hash)
>> {
>> struct fib_info *fi = res->fi;
>> + struct neighbour *n;
>> + int state;
>>
>> for_nexthops(fi) {
>> if (hash > atomic_read(&nh->nh_upper_bound))
>> continue;
>>
>> - res->nh_sel = nhsel;
>> - return;
>> + state = NUD_NONE;
>> + n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
>
> Sometimes nh_gw is a local IP, you can call
> neigh_lookup (or a lockless RCU-BH variant) only for the
> nh_scope == RT_SCOPE_LINK case, just like in fib_select_default.
got it, will change.
>
>> + if (n) {
>> + state = n->nud_state;
>> + neigh_release(n);
>> + }
>> + if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) {
>
> NUD_REACHABLE is part of NUD_VALID, so this is shorter:
>
> if (!n || (state & NUD_VALID)) {
sure.
>
>> + res->nh_sel = nhsel;
>> + return;
>> + }
>> + } endfor_nexthops(fi);
>> +
>> + /* try the nexthops again, but covering the entries
>> + * skipped by the hash
>> + */
>> + fi = res->fi;
>> + for_nexthops(fi) {
>> + if (hash <= atomic_read(&nh->nh_upper_bound))
>
> This is dangerous, we can try a RTNH_F_DEAD entry
> with nh_upper_bound = -1. This can work with 2 checks:
In v2 of my patch I dropped this change as it technically is not needed
for the case I am trying to solve.
>
> if (hash <= atomic_read(&nh->nh_upper_bound))
> break;
> if (atomic_read(&nh->nh_upper_bound) < 0)
> continue;
>
>> + continue;
>> +
>> + state = NUD_NONE;
>> + n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
>> + if (n) {
>> + state = n->nud_state;
>> + neigh_release(n);
>> + }
>> + if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) {
>> + res->nh_sel = nhsel;
>> + return;
>> + }
>> } endfor_nexthops(fi);
>
> If all are failed why not use the nexthop that was
> selected by the hash? Even if it is failed, new ARP probe
> can succeed.
sure.
Thanks for the comments.
David
next prev parent reply other threads:[~2016-03-25 2:05 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-24 15:25 [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops David Ahern
2016-03-24 16:45 ` Eric Dumazet
2016-03-24 17:43 ` David Ahern
2016-03-24 17:55 ` Eric Dumazet
2016-03-24 18:26 ` David Miller
2016-03-24 22:33 ` Julian Anastasov
2016-03-25 2:05 ` David Ahern [this message]
2016-03-25 9:05 ` Julian Anastasov
2016-03-30 3:16 ` David Ahern
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56F49CD3.1060406@cumulusnetworks.com \
--to=dsa@cumulusnetworks.com \
--cc=ja@ssi.bg \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.