From mboxrd@z Thu Jan 1 00:00:00 1970 From: Manish Kathuria Date: Mon, 30 Jan 2006 03:50:27 +0000 Subject: Re: [LARTC] Problems in Dead Gateway Detection / Failover - Message-Id: <43DD8A33.9020305@tuxspace.com> List-Id: References: <43D8CEAE.3010006@tuxspace.com> In-Reply-To: <43D8CEAE.3010006@tuxspace.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: lartc@vger.kernel.org gypsy wrote: > Manish Kathuria wrote: > --= snip =-- > >> However, if there is a problem in the ISP connectivity at any of the >>subsequent hops, there is no dead gateway detection and failover also >>does not take place. I have tested this on various linux kernels from >>2.4 as well as 2.6 series. >> >>Somehow I have never faced a similar problem before and things have been >>working perfectly. In real life situation here, the first hop gateway is >>rarely going to be down so dead gateway detection and failover is going >>to be required whenever there is some connectivity problem at any of the >>later hops. So that's where dead gateway detection needs to work. >> >>What could be the reason ? How can this be resolved ? I would appreciate >>any pointers or suggestions. >> >>Thanks, >> >>Manish Kathuria > > > Manish, > > Same here (a long time ago. I no longer have multiple ISPs). > > I don't have any answers for you, but here are a few pointers: Thanks for your mail. I wil try out the suggestions given by you. > > Use arping in a script, pinging the farthest hop that arping can reach > that is of interest. Whenever arping returns a bad status, run 'ip > route flush cache'. Put a nice long sleep in the script and run it all > the time. > > Perhaps in that same script, 'ping -n1 -I' each WAN interface in turn to > some destination that must always be up but reachable only by/on that > interface. Run 'ip route flush cache' whenever that ping fails. The only thing is whether by doing this the kernel would be able to mark the gateway having bad status as down or not. If it does not any other intervention, then its really superb. > > You are just trying to detect the up or down status of the link, so > don't flood the connection with arping and ping packets. Using sleep, > space those pings apart to something sensible. I was thinking of writing a daemon which will ping a remote host through each of the WAN interfaces every 5 seconds. If one of them gives a bad status response continuosly for 8-10 times, the default route will be changed to the other ISP's gateway and if the status changes again, it will be restored back to the load balanced multipath state. Will have to actually try and see which method fits in better here and is more elegant. If your suggestion works, its perhaps the best way out. > > Although Julian has never confirmed (or denied) this, it was my > experience that only the **__FIRST__** nexhop affected the up or down > status of the connection. If that succeeded, nothing would flag the > connection as dead. If you know C, perhaps you can examine Julian's > kernel patch to see if there is any useful information there. In my > opinion, Julian should document exactly how DGD works. Perhaps he has > and I just can't find it on his web site, but (when I cared), I was not > able to find anything useful there. There are excellent documents at http://www.ssi.bg/~ja/dgd-usage.txt and http://www.ssi.bg/~ja/nano.txt which have explained it very well. Quoting from the dgd-usage.txt document here ... ---Begin Quote--- * the alternative routes check the neighbour state not only for gateways but for hosts, i.e. for any kind of neighbours. Note that in some cases the neighbour can remain in reachable state while its nexthops are failed. For example, it is even possible the gateway to be a proxy ARP server and the gateway IP to remain always in reachable state. In such case we can not notice the real state of the gateway's IP. * the alternative routes can be a list from unipath or multipath routes, using NOARP and ARP devices. As result, the first alive or first suspected (but not dead) route is selected by inspecting the state of the gateways in each path or the neighbours through the used device from the path. * as result we take care of the state of each path in a multipath route and we try to use only the alive paths considering their relative weights ---End Quote--- In the current situaion I am dealing with, the firsthop gateway is always reachable. It is only the subsequent hops which can go down. And when that happens, the dead gateway detection doesnt work, the outgoing traffic keeps on going out through the dead ISP's WAN interface. But what confuses me is that DGD does work for one of the ISPs which is also identically connected. Could running routed / gated play a role here in resolving this problem ? > > Have you tried to engage Julian in a conversation to resolve this? He > posts here occasionally but I do not know if he answers questions about > DGD off this list. I have not done it so far. > -- > gypsy > Thanks once again for your suggestions. -- Manish Kathuria _______________________________________________ LARTC mailing list LARTC@mailman.ds9a.nl http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc