From: Dmitry Akindinov <dimak@stalker.com>
To: Julian Anastasov <ja@ssi.bg>
Cc: lvs-devel@vger.kernel.org
Subject: Re: Multiple load balancers problem
Date: Mon, 27 Aug 2012 19:27:19 +0400 [thread overview]
Message-ID: <503B91D7.7070901@stalker.com> (raw)
In-Reply-To: <503B8EF6.5040609@stalker.com>
Hello,
An addition below.
On 2012-08-27 19:15, Dmitry Akindinov wrote:
> Hello,
>
> Sorry for top posting: doing this to avoid clutter below.
>
> Thank you for your assistance. Let me to summarize the current
> situation, after all the changes and testing.
>
> 1. The test system consists of two servers, S1 and S2. Both running
> CentOS 6.0:
>
> [root@fm1 ~]# uname -a
> Linux fm1.***.com 2.6.32-71.el6.x86_64 #1 SMP Fri May 20 03:51:51 BST 2011
> x86_64 x86_64 x86_64 GNU/Linux
>
> We are now setting up new boxes (CentOS 6.3) to re-test with newer kernels
>
> 2. Both systems have iptables configured to mark the traffic to VIP with
> the "100" marker.
>
> 3. At the beginning of the test,
> IPSV on S1 configuration is:
> -A -f 100 -s rr -p 1
> -a -f 100 -r S1:0 -g -w 100
> -a -f 100 -r S2:0 -g -w 100
>
> IPSV on S2 configuration is:
> -A -f 100 -s rr -p 1
> -a -f 100 -r S1:0 -g -w 0
> -a -f 100 -r S2:0 -g -w 100
>
> We establish test connections from client systems to the port 110 of
> VIP, the S1 routes one connection to itself, the other one to S2. Both
> connections are alive and well, the connection tables on both systems
> are the same due to ipsv syncing daemons:
>
> ipsvadm -l -c -n:
> IP 00:49 NONE C1:0 0.0.0.100:0 S1:0
> TCP 14:49 ESTABLISHED C1:54837 VIP:110 S1:110
> TCP 14:43 ESTABLISHED C2:54648 VIP:110 S2:110
> IP 00:43 NONE C2:0 0.0.0.100:0 S2:0
>
> Now, we initiate a failover, so S2 becomes the active balancer.
> The IPSV rules on S2 are updated, so they become the same as they were
> on S1:
> -A -f 100 -s rr -p 1
> -a -f 100 -r S1:0 -g -w 100
> -a -f 100 -r S2:0 -g -w 100
>
> And S1 gets the same config as S2 used before the failover.
>
> All the connections that existed on S2 before failover continue to work.
>
> But the connections that existed on S1 are closed as soon as the client
> sends any data to that connection. The tcpdump on S1 does not show any
> incoming packets, and tcpdump on S2 shows that it's S2 itself (a new
> load balancer) that closes these connections (the data the client has
> sent was "HELP\r\n" - 6 bytes):
>
> 07:54:59.214200 IP (tos 0x10, ttl 54, id 20406, offset 0, flags [DF],
> proto TCP (6), length 58)
> C1.54837 > VIP.110: Flags [P.], cksum 0xba0d (correct), seq
> 3572724860:3572724866, ack 1018696840, win 33304, options [nop,nop,TS
> val 3371318384 ecr 1243703099], length 6
> 07:54:59.214253 IP (tos 0x10, ttl 64, id 0, offset 0, flags [DF], proto
> TCP (6), length 40)
> VIP.110 > C1.54837: Flags [R], cksum 0x5767 (correct), seq 1018696840,
> win 0, length 0
>
> What can cause the new load balancer to reset (Flags [R.]) the existing
> connections to the "old" balancer?
>
> New connections now work fine, being distributed by the new load
> balancer to itself and to the old balancer.
PS. the look at the ipvsadm -l -c -n of the new balancer showed that the
troubled connection (directed to the "old" balancer)
appears in the ESTABLISHED state before and after the client has sent
some data, and the new load balancer designed to drop the connection.
The connection tracking is switched off on both servers:
*raw
:PREROUTING ACCEPT [887797:396864975]
:OUTPUT ACCEPT [426902:66177111]
-A PREROUTING -d VIP/32 -j NOTRACK
The backup balancer used the following commands when it became the new
active balancer:
43 STARTBALANCER\n
* switching on
* ipvsadm -e -f 100 -r S1 -g -w 100
* ipvsadm --stop-daemon backup
* ipvsadm --start-daemon master --mcast-interface eth0 --syncid 0
* sysctl net.ipv4.conf.all.arp_ignore=0
* result=net.ipv4.conf.all.arp_ignore = 0
* sysctl net.ipv4.conf.eth0.arp_ignore=0
* result=net.ipv4.conf.eth0.arp_ignore = 0
* sysctl net.ipv4.conf.all.arp_announce=0
* result=net.ipv4.conf.all.arp_announce = 0
* sysctl net.ipv4.conf.eth0.arp_announce=0
* result=net.ipv4.conf.eth0.arp_announce = 0
* arping -c 1 -I eth0 -U VIP
* result=ARPING VIP from VIP eth0 Sent 1 probes (1 broadcast(s))
Received 0 response(s)
as you can see, the rule for the virtual server (-A -f 100) and the rule
for the local real server were not touched, and the record for the other
server (old load balancer) was editted, not removed and added
ipvsadm -e -f 100 -r S1 -g -w 100
to make it "active" (it was -w 0 while this server was not an active
balancer).
Still, when this server becomes the active balancer, it resets all
existing connections to the old balancer.
> On 2012-08-27 15:17, Julian Anastasov wrote:
>>
>> Hello,
>>
>> On Mon, 27 Aug 2012, Dmitry Akindinov wrote:
>>
>>>> OK, I don't know what kernel and patches includes
>>>> every distribution. Can you tell at least what shows uname -a?
>>>
>>> Ah, sorry. That was
>>>
>>> [root@fm1 ~]# uname -a
>>> Linux fm1.***.com 2.6.32-71.el6.x86_64 #1 SMP Fri May 20 03:51:51 BST
>>> 2011
>>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> I downloaded kernel-2.6.32-71.el6.src.rpm and I see
>> that it does not contain the needed changes to support
>> backup to be real server for DR/TUN:
>>
>> commit fc604767613b6d2036cdc35b660bc39451040a47
>> Author: Julian Anastasov<ja@ssi.bg>
>> Date: Sun Oct 17 16:38:15 2010 +0300
>>
>> ipvs: changes for local real server
>>
>> and to support fwmark for SYNC:
>>
>> commit fe5e7a1efb664df0280f10377813d7099fb7eb0f
>> Author: Hans Schillstrom<hans.schillstrom@ericsson.com>
>> Date: Fri Nov 19 14:25:12 2010 +0100
>>
>> IPVS: Backup, Adding Version 1 receive capability
>>
>> Functionality improvements
>> * flags changed from 16 to 32 bits
>> * fwmark added (32 bits)
>> * timeout in sec. added (32 bits)
>> * pe data added (Variable length)
>> * IPv6 capabilities (3x16 bytes for addr.)
>> * Version and type in every conn msg.
>>
>>> Yes, exactly. And to avoid this "secondary load balancing", we
>>> do not load the rules into ipvs until it becomes the active balancer.
>>>
>>> Looks like it's causing problems, so the alternative we are using now
>>> is to load the rules, but make them balance everything to a single
>>> server - the local one.
>>
>> It seems even this is not enough because when
>> the backup receives the sync message it creates SYNC
>> connection (after passing the initial SYN and ACK) but
>> this connection claims this backup is a real server
>> and is using DR method. Without the commit
>> fc604767613b6d2036cdc35b660bc39451040a47
>> when next packets come ip_vs_dr_xmit tries to send them
>> to LOCAL_OUT (DR forwarding) instead of returning
>> NF_ACCEPT as for LOCALNODE. As result, packet does not
>> reach local stack as the previous SYN and ACK packets
>> and may be you see that packet loops in the stack cuasing
>> 100% CPU usage as you said below that it disappears:
>>
>>> Now, we see the client trying to send some data to the server,
>>> and we see the data packet hitting the active load balancer,
>>> and then - the inactive load balancer. And there we see the
>>> packet disappearing - the application does not see it, and since
>>> there is not "ack" sent back to the client, we see the client
>>> TCP stack resending that packet over and over, but all resent
>>> packets have the same fate - they disappear inside the inactive
>>> load balancer.
>>>
>>> We can send the actual tcpdumps if needed.
>>
>> Not needed, I think, you need kernel update.
>>
>>>> directs the SYN there. It can happen only for DR/TUN because
>>>> the daddr is VIP, that is why people overcome the problem
>>>> by checking that packet comes from some master and not
>>>> from uplink gateway MAC. For NAT there is no such double-step
>>>> scheduling because the backups' rules do not match the
>>>> internal real server IP in the daddr, they work only for VIP
>>>
>>> No, this is not the case. The backup balancer did not have rules,
>>
>> Yes, I just explained this variant too.
>>
>>>> Interesting, new master forwards to old master,
>>>> so it should send SYNC containing the old master as real
>>>> server, how can there be a problem, may be your kernel does
>>>> not support properly the local server function which is
>>>> fixed 2 years ago.
>>>
>>> Hmm. I assume the kernel we use is pretty fresh.
>>
>> I see ip_vs_conn.c from Sep 1 2010 is the latest
>> file from IPVS.
>>
>>>> May be SYNC message changes the destination in
>>>> backup as I already said above? Some tcpdump output will
>>>> be helpful in case you don't know how to dig into the
>>>> sources of your kernel.
>>>
>>> There is no change in destination. The dropped packets are really
>>> dropped, not
>>> relayed somewhere. Also, if they were relayed, they could only be
>>> relayed to
>>> the active balancer, as ipvs config only has or had these two servers
>>> in it.
>>> And tcpdump on the active balancer properly shows the packets sent to
>>> the
>>> backup balancer, but no packets coming back from that balancer.
>>
>> Yes, may be they loop in stack: DR via LOCAL_OUT,
>> then they appear again in LOCAL_IN for forwarding?
>>
>>>> Very good, only that you need recent kernel for this,
>>>> 2010-Nov +, there are fixes even after that time.
>>>
>>> Yes, it looks like we have the kernels built in May-2011.
>>
>> Yep.
>>
>>>> table and you can switch between them at any time. Of
>>>> course, there is some performance price for traffic that
>>>> goes to the local stack of backups but they should get from
>>>> current master only traffic for their stack.
>>>
>>> That's not what concerns us. IPVS on the backup balancer is now
>>> being filled by 2 sources: the "sync" process, which copies records
>>> from the active balancer, and the IPVS itself.
>>>
>>> I.e. now (when we have rules in the backup balancer, too) -
>>> when a new connection arrives to the backup balancer,
>>> the balancer creates a connection record and places it into its
>>> connection table.
>>> A few moments later, the sync daemon receives a connection
>>> record for the same connection from the active load balancer,
>>> and it also wants to put that record into the connection table
>>> on the backup balancer.
>>> Our concern is a potential conflict here: that record is already
>>> in the table. If you say that there can be no conflict - it would
>>> be nice, but we do not know how ipvs is designed, so we
>>> cannot get rid of that concern on our own.
>>
>> May be we should stop any forwarding while we are
>> in backup mode. The problem is that we can be both in
>> master and backup mode and I'm not sure if this is used
>> at all. I guess master and backup use different syncid but
>> anyways, may be such setup works only for NAT.
>>
>>> When a backup balancer is instructed to become an active one,
>>> our application automatically loads the ruleset with all other
>>> real servers into its ipvs rule set, and then sends arp broadcast
>>> for all VIPs, switching the traffic to the new active balancer.
>>>
>>> The existing connections should survive, as the connection table
>>> contains all the records sync'ed from the old active balancer, right?
>>
>> Yes.
>>
>>> The interesting question is how ipvs assigns the connection records
>>> received via the sync protocol: as we have seen, we had to put the
>>> virt server and the local real server rules into ipvs in order to stop
>>> the problem of the "backup" mode.
>>> Now, during the failover, we add the rules for other "real servers"
>>> AFTER the connection records for their connections were received
>>> from the then-active balancer. Will it cause the same type of problem?
>>
>> Not fatal but without rules we can not maintain
>> actual counters for active/inactive conns. After failover
>> the setup will start with zeroed counters that are
>> later modified only for new connections, all SYNCed conns are
>> not accounted and the first minutes after failover we can
>> see some imbalance.
>>
>> Regards
>>
>> --
>> Julian Anastasov<ja@ssi.bg>
>
--
Best regards,
Dmitry Akindinov -- Stalker Labs.
next prev parent reply other threads:[~2012-08-27 15:27 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-08-25 7:37 Multiple load balancers problem Dmitry Akindinov
2012-08-25 10:13 ` Dmitry Akindinov
2012-08-25 11:53 ` Julian Anastasov
2012-08-27 8:02 ` Dmitry Akindinov
2012-08-27 11:17 ` Julian Anastasov
2012-08-27 15:15 ` Dmitry Akindinov
2012-08-27 15:27 ` Dmitry Akindinov [this message]
2012-08-27 16:13 ` Julian Anastasov
2012-08-27 20:24 ` Dmitry Akindinov
2012-08-28 7:21 ` Julian Anastasov
-- strict thread matches above, loose matches on Subject: below --
2012-08-27 20:43 Re[2]: " Hans Schillstrom
2012-08-30 17:24 ` Dmitry Akindinov
2012-08-30 20:00 ` Julian Anastasov
2012-08-31 8:21 Re[2]: " Hans Schillstrom
2012-09-03 7:54 ` Dmitry Akindinov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=503B91D7.7070901@stalker.com \
--to=dimak@stalker.com \
--cc=ja@ssi.bg \
--cc=lvs-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.