From mboxrd@z Thu Jan  1 00:00:00 1970
From: Manish Kathuria <manish@tuxspace.com>
Date: Mon, 30 Jan 2006 03:50:27 +0000
Subject: Re: [LARTC] Problems in Dead Gateway Detection / Failover -
Message-Id: <43DD8A33.9020305@tuxspace.com>
List-Id: <lartc.vger.kernel.org>
References: <43D8CEAE.3010006@tuxspace.com>
In-Reply-To: <43D8CEAE.3010006@tuxspace.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: lartc@vger.kernel.org

gypsy wrote:
> Manish Kathuria wrote:
> --= snip =--
> 
>>  However, if there is a problem in the ISP connectivity at any of the
>>subsequent hops, there is no dead gateway detection and failover also
>>does not take place. I have tested this on various linux kernels from
>>2.4 as well as 2.6 series.
>>
>>Somehow I have never faced a similar problem before and things have been
>>working perfectly. In real life situation here, the first hop gateway is
>>rarely going to be down so dead gateway detection and failover is going
>>to be required whenever there is some connectivity problem at any of the
>>later hops. So that's where dead gateway detection needs to work.
>>
>>What could be the reason ? How can this be resolved ? I would appreciate
>>any pointers or suggestions.
>>
>>Thanks,
>>
>>Manish Kathuria
> 
> 
> Manish,
> 
> Same here (a long time ago.  I no longer have multiple ISPs).
> 
> I don't have any answers for you, but here are a few pointers:

Thanks for your mail. I wil try out the suggestions given by you.

> 
> Use arping in a script, pinging the farthest hop that arping can reach
> that is of interest.  Whenever arping returns a bad status, run 'ip
> route flush cache'.  Put a nice long sleep in the script and run it all
> the time. >
> Perhaps in that same script, 'ping -n1 -I' each WAN interface in turn to
> some destination that must always be up but reachable only by/on that
> interface.  Run 'ip route flush cache' whenever that ping fails.

The only thing is whether by doing this the kernel would be able to mark 
the gateway having bad status as down or not. If it does not any other 
intervention, then its really superb.

> 
> You are just trying to detect the up or down status of the link, so
> don't flood the connection with arping and ping packets.  Using sleep,
> space those pings apart to something sensible.

I was thinking of writing a daemon which will ping a remote host through 
each of the WAN interfaces every 5 seconds. If one of them gives a bad 
status response continuosly for 8-10 times, the default route will be 
changed to the other ISP's gateway and if the status changes again, it 
will be restored back to the load balanced multipath state.

Will have to actually try and see which method fits in better here and 
is more elegant. If your suggestion works, its perhaps the best way out.

> 
> Although Julian has never confirmed (or denied) this, it was my
> experience that only the **__FIRST__** nexhop affected the up or down
> status of the connection.  If that succeeded, nothing would flag the
> connection as dead.  If you know C, perhaps you can examine Julian's
> kernel patch to see if there is any useful information there.  In my
> opinion, Julian should document exactly how DGD works.  Perhaps he has
> and I just can't find it on his web site, but (when I cared), I was not
> able to find anything useful there.

There are excellent documents at http://www.ssi.bg/~ja/dgd-usage.txt and 
http://www.ssi.bg/~ja/nano.txt which have explained it very well. 
Quoting from the dgd-usage.txt document here ...


---Begin Quote---

* the alternative routes check the neighbour state not only for gateways
but  for hosts, i.e. for any kind of neighbours. Note that in some cases
the  neighbour  can remain  in reachable  state  while its  nexthops are
failed.   For example, it is even possible the gateway to be a proxy ARP
server  and the gateway IP to remain  always in reachable state. In such
case we can not notice the real state of the gateway's IP.

* the alternative routes can be a list from unipath or multipath routes,
using  NOARP  and  ARP devices.  As  result,  the first  alive  or first
suspected  (but not dead)  route is selected by  inspecting the state of
the gateways in each path or the neighbours through the used device from
the path.

* as  result we take care of the state of each path in a multipath route
and  we  try to  use  only the  alive  paths considering  their relative
weights

---End Quote---

In the current situaion I am dealing with, the firsthop gateway is 
always reachable. It is only the subsequent hops which can go down. And 
when that happens, the dead gateway detection doesnt work, the outgoing 
traffic keeps on going out through the dead ISP's WAN interface. But 
what confuses me is that DGD does work for one of the ISPs which is also 
identically connected.

Could running routed / gated play a role here in resolving this problem ?

> 
> Have you tried to engage Julian in a conversation to resolve this?  He
> posts here occasionally but I do not know if he answers questions about
> DGD off this list.

I have not done it so far.

> --
> gypsy
> 

Thanks once again for your suggestions.

--
Manish Kathuria
_______________________________________________
LARTC mailing list
LARTC@mailman.ds9a.nl
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc