Re: [LARTC] Multihome load balancing - kernel vs netfilter

From: Peter Rabbitson <rabbit@rabbit.us>
To: lartc@vger.kernel.org
Subject: Re: [LARTC] Multihome load balancing - kernel vs netfilter
Date: Mon, 14 May 2007 07:15:56 +0000	[thread overview]
Message-ID: <46480CAC.6050002@rabbit.us> (raw)
In-Reply-To: <4647FA30.5040401@rabbit.us>

Salim S I wrote:
>> -----Original Message-----
>> From: lartc-bounces@mailman.ds9a.nl
>> [mailto:lartc-bounces@mailman.ds9a.nl] On Behalf Of Peter Rabbitson
>> Sent: Monday, May 14, 2007 1:57 PM
>> To: lartc@mailman.ds9a.nl
>> Subject: [LARTC] Multihome load balancing - kernel vs netfilter
>> 
>> Hi,
>> I have searched the archives on the topic, and it seems that the list
>> gurus favor load balancing to be done in the kernel as opposed to other
>> means. I have been using a home-grown approach, which splits traffic
>> based on `-m statistic --mode random --probability X`, then CONNMARKs
>> the individual connections and the kernel happily routes them. I
>> understand that for > 2 links it will become impractical to calculate a
>> correct X. But if we only have 2 gateways to the internet - are there
>> any advantages in letting the kernel multipath scheduler do the
>> balancing (with all the downsides of route caching), as opposed to the
>> pure random approach described above?
> 
> I have thought about this approach, but, I think, this approach does not
> handle failover/dead-gateway-detection well. Because you need to alter
> all your netfilter routing rules if you find a link down. And then
> reconfigure again when the link comes up. I am interested to know how
> you handle that.
> 

Certainly. What I am doing is NATing a large company network, which gets
load balanced and receives fail over protection. I also have a number of
services running on the router which must not be balanced nor failed
over, as they are expected to respond on a specific IP only. All
remaining traffic on the server itself is not balanced but fails over
when the designated primary link goes down.

I start with a simple pinger app, that pings several well known remote
sites once a minute using a large icmp packet (1k of payload). The rtt
times are averaged out and are used to calculate the current "quality"
of the link (the large packet makes congestion a visible factor). If one
of the interface responses is 0 (meaning not a single one of the pinged
hosts has responded) - the link is dead.

In iproute I have two separate tables, each using one of the links as
default gw, matching a certain mark. The default route is set to a
single gateway (not a multipath), either by hardcoding, or by using the
first input of the pinger (it can run without a default gw set,
explanation follows)

In iptables I have two user defined chains:
    iptables -t mangle -A ISP1 -j CONNMARK --set-mark 11
    iptables -t mangle -A ISP1 -j MARK --set-mark 11
    iptables -t mangle -A ISP1 -j ACCEPT

    iptables -t mangle -A ISP2 -j CONNMARK --set-mark 12
    iptables -t mangle -A ISP2 -j MARK --set-mark 12
    iptables -t mangle -A ISP2 -j ACCEPT

The rules that reference those chains are:

For all locally originating traffic:
    iptables -t mangle -A OUTPUT -o $I1 -j ISP1
    iptables -t mangle -A OUTPUT -o $I2 -j ISP2

For all incoming traffic from the internet:
    iptables -t mangle -A PREROUTING -i $I1 -m state --state NEW -j ISP1
    iptables -t mangle -A PREROUTING -i $I2 -m state --state NEW -j ISP2

For all other traffic (nat)
    iptables -t mangle -A PREROUTING -m state --state NEW -m statistic
--mode random --probability $X -j ISP1
    iptables -t mangle -A PREROUTING -j ISP2

At the end of the PREROUTING cain I have
    iptables -t mangle -A PREROUTING -j CONNMARK --restore-mark

The NATing is trivially solved by:
    iptables -t nat -A POSTROUTING -s 10.0.58.0/24 -j SOURCE_NAT
    iptables -t nat -A POSTROUTING -s 192.168.58.0/24 -j SOURCE_NAT
    iptables -t nat -A POSTROUTING -s 192.168.8.0/24 -j SOURCE_NAT

    iptables -t nat -A SOURCE_NAT -o $I1 -j SNAT --to $I1_IP
    iptables -t nat -A SOURCE_NAT -o $I2 -j SNAT --to $I2_IP

What does this achieve:
* Local applications that have explicitly requested a specific IP to
bind to, will be routed over the corresponding interface and will stay
that way. Only applications binding to 0.0.0.0 will be routed by
consulting the default route.
* Responses to connections from the internet are guaranteed to leave
from the same interface they came in.
* All new connection not coming from the external interfaces are load
balanced by the weight of $X, and are again guaranteed to stay there for
 the life of the connection, but another connection to the same host is
not guaranteed to go over the same link. This is important in a company
environment, since most employees use the same online resources.

On every run of the pinger I do the following:
* If both gateways are alive I replace the -m statistic rule, adjusting
the value of $X
* If one is detected dead, I adjust the probability accordingly (or
alternatively remove the statistic match altogether), and change the
default gateway if it is the one that failed.

So really the whole exercise revolves around changing a single rule (or
two rules, if you want to control the probability in a more fine-grained
way).

Last but not least this setup allowed me to program exception tables for
certain IP blocks. For instance Yahoo has a braindead two tier
authentication system for commercial solutions. It remembers the IP
which you used to login with first, and it must match the IP used to
login to a more secure area (using another password). Or users from
within the lan might want to use one of the ISPs SMTP servers, which
keeps a close eye on who is talking to it. So I have a $PREFERRED which
is adjusted to either ISP1 or ISP2, depending on the current state of
affairs, and rules like:
    iptables -t mangle -A PREROUTING -d 66.218.64.0/19 -m state --state
NEW -j $PREFERRED
    iptables -t mangle -A PREROUTING -d 68.142.192.0/18 -m state --state
NEW -j $PREFERRED

This pretty much sums it up. The only downside I can think of is that
loss of service can be observed between two runs of the pinger. Let me
know if I missed something be it critical or minor.

Thanks

Peter
_______________________________________________
LARTC mailing list
LARTC@mailman.ds9a.nl
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc