[RFC] ct_sync 0.15 (corrected)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] ct_sync 0.15 (corrected)
@ 2004-08-13 14:26 KOVACS Krisztian
  2004-08-19 11:06 ` Harald Welte
                   ` (3 more replies)
  0 siblings, 4 replies; 55+ messages in thread
From: KOVACS Krisztian @ 2004-08-13 14:26 UTC (permalink / raw)
  To: Netfilter-failover list; +Cc: netfilter-devel, jamal


  Hi,

  [warning: this mail got quite long]

  I think finally I managed to get ct_sync in a state where some more
public review/testing would certainly do good for the project. This week
I've been successfully running overnight stress tests on our two-node
cluster, and both of the nodes survived all of the tests without locking
up, panicking, etc. OK, I know this is not too much, but is certainly
better than anything we had up to now, and somehow I felt this would be
the bare minimum I should reach before asking anyone to try out ct_sync.
Of course, there is still a lot to do. Just to mention two things which
certainly need a lot of work: protocol implementation is really the bare
minimum which is capable of doing anything, and expectation support is
missing completely. The current implementation supports replicating
conntrack entries, and supports NAT. So basically all the simple things
(not using expectations) should work reasonably well. This is why I've
bumped the version number to 0.15, it is available in the Netfilter CVS.

  The aim of this e-mail is twofold:

      * First, I just thought that ct_sync is now ready for some
        testing. Unfortunately our development and testing environment
        is very limited, so anyone being able to help us out and do some
        testing would be of a big help.

      * After doing some work to stabilize the current code, there is
        now need to discuss a few things before going on with
        implementation. So, I'd be happy if there was some discussion on
        these things before starting to implement anything.

  I'll try to summarize the most important open problems I've come
through. This list does not contain things I consider trivial, for
example exporting important internal constants as tunable parameters
through sysctl.

     1. There should some facility by which one can select which
        connections have to be replicated. This way it would be possible
        to limit replication traffic to the bare minimum. For example,
        there is no point in replicating conntrack entries for
        connections whose endpoint is one of the nodes (administrative
        SSH traffic, for example). A per-conntrack flag would be needed,
        just like CONNMARK, which could be set for conntracks needing
        replication with a simple iptables rule. Actually, CONNMARK is
        enough, if we choose a given bit of the mark as the SYNC bit.
        Besides this, we should decide if we needed a SYNC or a NOSYNC
        bit, that is, if the default mode of operation should be "sync
        or not to sync".

     2. The error recovery functions in the protocol layer should be
        revamped. The protocol is a plain sequence numbered, NACK-based
        one using multicast UDP. Lost packets are detected by the
        receiver when receiving the next packet with the inappropriate
        sequence number (not 'seq of the last packet + 1') is received.
        When this occurs, the node sends a recovery request containing
        the last successfully received sequence number. When the current
        master receives such a request, it should re-send all matching
        packets from the backlog of its send ring. However, to iterate
        over the entries of the ring, it should hold the spinlock of the
        ring, which is not possible, since the send() operation may
        sleep... (This is done from the receiver thread, and the ring is
        accessed from the sender thread and from softirq context as well.)
        What would be the most elegant solution?

        On the other hand, it may be possible that the master is not
        able to re-send the packet, for example this may be the case if
        it is "too old", and is not present in the backlog anymore. In
        this case, the slave should be notified that recovery is not
        possible this way, and it needs to do a full re-sync. This is
        why I thought that we should include some extra information in
        every packet: the minimal sequence number of the oldest packet
        in the master's backlog. Using this approach upon receiving a
        packet with a 'wrong' sequence number the slave can immediately
        decide if there is still hope of recovering the missing packets
        or not. If not, it requests full-resynchronization instead. I
        think this feature could handle the problem of a broken link
        between the nodes causing a lots of lost packets. Am I missing
        something?

        The protocol layer is really very dumb in other respects as well.
        For example it simply drops all packets considered too fresh
        instead of queuing them. (Although I would like to add that
        typically there should be no errors on the replication network
        at all, except because of administrative reasons.) However, I
        really don't think we should develop a full-blown protocol,
        there is simply no point in creating one more reliable protocol...
        So, do anyone know of anything which could be used by ct_sync?
        (It has to be a semi-reliable, connectionless multicast protocol
        with a _very_ low overhead.)

     3. There are a few things in the connection tracking code which
        are incompatible with replication "by design". For example,
        the expectfn() function in the expectation structure is such:
        simply, there is no way to replicate a stand-alone function
        pointer which could point to any arbitrary function. One more
        example could be TCP window tracking, I don't think we have
        the necessary bandwidth and CPU time to send an update message
        after each and every received TCP packet... Any idea how we
        could solve these problems?

     4. The current version is 2.4-only, it is for the good old
        ip_conntrack, and supports IPv4 only. I don't really think
        this is the way to go, but there is commercial interest in
        having this kind of failover functionality as fast as possible.
        However, I think that after reaching some state which is
        acceptable for the users needing the basic features fast, this
        whole thing should be re-designed and ported to 2.6 and
        nf_conntrack. This would depend on a few other things, such as
        porting ctnetlink for nf_conntrack, but I thing those would
        be important to have as well. Again, this would be quite a
        lot work to do, thus deferring the 'stable' (production ready)
        release of the code.


  Wow, this got quite long, thanks for reading all this. :)
And happy testing and reporting back! :)

-- 
 Regards,
   Krisztian KOVACS

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-13 14:26 [RFC] ct_sync 0.15 (corrected) KOVACS Krisztian
@ 2004-08-19 11:06 ` Harald Welte
  2004-08-19 12:13   ` KOVACS Krisztian
                     ` (3 more replies)
  2004-08-22  0:40 ` Patrick McHardy
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 55+ messages in thread
From: Harald Welte @ 2004-08-19 11:06 UTC (permalink / raw)
  To: KOVACS Krisztian; +Cc: Netfilter-failover list, netfilter-devel, jamal

[-- Attachment #1: Type: text/plain, Size: 5288 bytes --]

On Fri, Aug 13, 2004 at 04:26:30PM +0200, KOVACS Krisztian wrote:

>      1. There should some facility by which one can select which
>         connections have to be replicated. This way it would be possible
>         to limit replication traffic to the bare minimum. For example,
>         there is no point in replicating conntrack entries for
>         connections whose endpoint is one of the nodes (administrative
>         SSH traffic, for example). A per-conntrack flag would be needed,
>         just like CONNMARK, which could be set for conntracks needing
>         replication with a simple iptables rule. Actually, CONNMARK is
>         enough, if we choose a given bit of the mark as the SYNC bit.
>         Besides this, we should decide if we needed a SYNC or a NOSYNC
>         bit, that is, if the default mode of operation should be "sync
>         or not to sync".

I would just use connmark for now.  Let's make it a CONFIG option
though, so people can just use connmark without any interference and
replicate all connections.

>      2. The error recovery functions in the protocol layer should be
>         revamped. 
> 	  However, to iterate over the entries of the ring, it should
> 	  hold the spinlock of the ring, which is not possible, since
> 	  the send() operation may sleep... (This is done from the
> 	  receiver thread, and the ring is accessed from the sender
> 	  thread and from softirq context as well.) What would be the
> 	  most elegant solution?

given that this is a event expected to happen very rarely, I would
propose to just:
- grab the lock
- copy the whole ring (or the needed parts)
- release the lock
- send packets from the local copy (may sleep)
- free local copy

>         On the other hand, it may be possible that the master is not
>         able to re-send the packet, for example this may be the case if
>         it is "too old", and is not present in the backlog anymore. In
>         this case, the slave should be notified that recovery is not
>         possible this way, and it needs to do a full re-sync.

Within the current protocol, the master can just make that decision and
do a full resync without telling the slave.

>	  This is why I thought that we should include some extra
>	  information in every packet: the minimal sequence number of
>	  the oldest packet in the master's backlog. 

Agreed.  We should also add a read-only sysctl that tells userspace
whether a slave is already fully-synced. 

>         So, do anyone know of anything which could be used by ct_sync?
>         (It has to be a semi-reliable, connectionless multicast protocol
>         with a _very_ low overhead.)

everything I've seen so far about reliable multicast is inherently
complex.

>      3. There are a few things in the connection tracking code which
>         are incompatible with replication "by design". For example,
>         the expectfn() function in the expectation structure is such:
>         simply, there is no way to replicate a stand-alone function
>         pointer which could point to any arbitrary function. 

Yes, indeed.  we could look up the symbol name in the symbol table and
replicate that ;)   Crude hack, but it would work.

> 	 One more example could be TCP window tracking, I don't think we
> 	 have the necessary bandwidth and CPU time to send an update
> 	 message after each and every received TCP packet... Any idea
> 	 how we could solve these problems?

We already do this since the timeout is updated with every packet.  So
at this point, I see not much difference.  Jozsef and I agreed some time
in the past, that if we don't replicate all the window information, in
the event of a slave being propagated to master, the new master should
disable windowtracking or switch into a lazy mode.

>      4. The current version is 2.4-only, it is for the good old
>         ip_conntrack, and supports IPv4 only. I don't really think
>         this is the way to go, but there is commercial interest in
>         having this kind of failover functionality as fast as possible.

Ack.

>         However, I think that after reaching some state which is
>         acceptable for the users needing the basic features fast, this
>         whole thing should be re-designed and ported to 2.6 and
>         nf_conntrack. This would depend on a few other things, such as
>         porting ctnetlink for nf_conntrack, but I thing those would
>         be important to have as well. Again, this would be quite a
>         lot work to do, thus deferring the 'stable' (production ready)
>         release of the code.

I would first make the 2.4.x version stable and almost feature-complete
(as far as possible).  We have then learned our lessons and can clean it
up while porting on top of nf_conntrack.

>  Regards,
>    Krisztian KOVACS

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-19 11:06 ` Harald Welte
@ 2004-08-19 12:13   ` KOVACS Krisztian
  2004-08-26 10:00     ` Jozsef Kadlecsik
  2004-08-19 12:13   ` KOVACS Krisztian
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 55+ messages in thread
From: KOVACS Krisztian @ 2004-08-19 12:13 UTC (permalink / raw)
  To: Harald Welte; +Cc: Netfilter-failover list, netfilter-devel

  Hi,

2004-08-19, cs keltezéssel 13:06-kor Harald Welte ezt írta:
> >         On the other hand, it may be possible that the master is not
> >         able to re-send the packet, for example this may be the case if
> >         it is "too old", and is not present in the backlog anymore. In
> >         this case, the slave should be notified that recovery is not
> >         possible this way, and it needs to do a full re-sync.
> 
> Within the current protocol, the master can just make that decision and
> do a full resync without telling the slave.

  Indeed. However, I'd like to implement a new state (SLAVE_BROKEN)
later, which could be used to avoid electing a slave as new master if
that node is known to be not in sync. A facility of this kind would help
in early detection of such cases.

> >      3. There are a few things in the connection tracking code which
> >         are incompatible with replication "by design". For example,
> >         the expectfn() function in the expectation structure is such:
> >         simply, there is no way to replicate a stand-alone function
> >         pointer which could point to any arbitrary function. 
> 
> Yes, indeed.  we could look up the symbol name in the symbol table and
> replicate that ;)   Crude hack, but it would work.

  Unfortunately I don't think it would work... AFAIK there is no symbol
table information in 2.4 kernels and usually these functions are
declared as static anyway. Moreover, take a look at the H.323 helper, it
uses this expectfn function to set the helper of the conntrack to an
unregistered helper structure... I think there is no point in making
ct_sync overly complicated; these helpers should be fixed instead. (I
don't know why the H.323 does things this way, but it is completely
hopeless to replicate things like this.)

> > 	 One more example could be TCP window tracking, I don't think we
> > 	 have the necessary bandwidth and CPU time to send an update
> > 	 message after each and every received TCP packet... Any idea
> > 	 how we could solve these problems?
> 
> We already do this since the timeout is updated with every packet.  So
> at this point, I see not much difference.  Jozsef and I agreed some time
> in the past, that if we don't replicate all the window information, in
> the event of a slave being propagated to master, the new master should
> disable windowtracking or switch into a lazy mode.

  Indeed, the timeout is updated with every packet. However, we do not
generate an update message for IPCT_REFRESH events at the moment. On the
other hand, we replicate the timeout changes when a state change occurs
as well, so I don't worry about incorrect (relative) timeout values at
all.

  Anyway, thanks for the comments.

-- 
 Regards,
   Krisztian KOVACS

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-19 11:06 ` Harald Welte
  2004-08-19 12:13   ` KOVACS Krisztian
@ 2004-08-19 12:13   ` KOVACS Krisztian
  2004-08-19 16:13   ` Henrik Nordstrom
  2004-08-22 20:43   ` KOVACS Krisztian
  3 siblings, 0 replies; 55+ messages in thread
From: KOVACS Krisztian @ 2004-08-19 12:13 UTC (permalink / raw)
  To: Harald Welte; +Cc: Netfilter-failover list, netfilter-devel

  Hi,

2004-08-19, cs keltezéssel 13:06-kor Harald Welte ezt írta:
> >         On the other hand, it may be possible that the master is not
> >         able to re-send the packet, for example this may be the case if
> >         it is "too old", and is not present in the backlog anymore. In
> >         this case, the slave should be notified that recovery is not
> >         possible this way, and it needs to do a full re-sync.
> 
> Within the current protocol, the master can just make that decision and
> do a full resync without telling the slave.

  Indeed. However, I'd like to implement a new state (SLAVE_BROKEN)
later, which could be used to avoid electing a slave as new master if
that node is known to be not in sync. A facility of this kind would help
in early detection of such cases.

> >      3. There are a few things in the connection tracking code which
> >         are incompatible with replication "by design". For example,
> >         the expectfn() function in the expectation structure is such:
> >         simply, there is no way to replicate a stand-alone function
> >         pointer which could point to any arbitrary function. 
> 
> Yes, indeed.  we could look up the symbol name in the symbol table and
> replicate that ;)   Crude hack, but it would work.

  Unfortunately I don't think it would work... AFAIK there is no symbol
table information in 2.4 kernels and usually these functions are
declared as static anyway. Moreover, take a look at the H.323 helper, it
uses this expectfn function to set the helper of the conntrack to an
unregistered helper structure... I think there is no point in making
ct_sync overly complicated; these helpers should be fixed instead. (I
don't know why the H.323 does things this way, but it is completely
hopeless to replicate things like this.)

> > 	 One more example could be TCP window tracking, I don't think we
> > 	 have the necessary bandwidth and CPU time to send an update
> > 	 message after each and every received TCP packet... Any idea
> > 	 how we could solve these problems?
> 
> We already do this since the timeout is updated with every packet.  So
> at this point, I see not much difference.  Jozsef and I agreed some time
> in the past, that if we don't replicate all the window information, in
> the event of a slave being propagated to master, the new master should
> disable windowtracking or switch into a lazy mode.

  Indeed, the timeout is updated with every packet. However, we do not
generate an update message for IPCT_REFRESH events at the moment. On the
other hand, we replicate the timeout changes when a state change occurs
as well, so I don't worry about incorrect (relative) timeout values at
all.

  Anyway, thanks for the comments.

-- 
 Regards,
   Krisztian KOVACS

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-19 11:06 ` Harald Welte
  2004-08-19 12:13   ` KOVACS Krisztian
  2004-08-19 12:13   ` KOVACS Krisztian
@ 2004-08-19 16:13   ` Henrik Nordstrom
  2004-08-22 20:43   ` KOVACS Krisztian
  3 siblings, 0 replies; 55+ messages in thread
From: Henrik Nordstrom @ 2004-08-19 16:13 UTC (permalink / raw)
  To: Harald Welte
  Cc: KOVACS Krisztian, Netfilter-failover list, netfilter-devel, jamal

On Thu, 19 Aug 2004, Harald Welte wrote:

> I would just use connmark for now.  Let's make it a CONFIG option
> though, so people can just use connmark without any interference and
> replicate all connections.

With the (conn)mark operations patch discussed recently on netfilter-devel 
it makes sense to use a bitmask for this, masking which bit(s) indicate a 
session should be replicated.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-13 14:26 [RFC] ct_sync 0.15 (corrected) KOVACS Krisztian
  2004-08-19 11:06 ` Harald Welte
@ 2004-08-22  0:40 ` Patrick McHardy
  2004-08-22  7:49   ` [nf-failover] " KOVACS Krisztian
  2004-09-02  5:10 ` Willy Tarreau
  2004-09-24  2:42 ` jamal
  3 siblings, 1 reply; 55+ messages in thread
From: Patrick McHardy @ 2004-08-22  0:40 UTC (permalink / raw)
  To: KOVACS Krisztian; +Cc: Netfilter-failover list, netfilter-devel, jamal

Hi Krisztian,

KOVACS Krisztian wrote:

>     4. The current version is 2.4-only, it is for the good old
>        ip_conntrack, and supports IPv4 only. I don't really think
>        this is the way to go, but there is commercial interest in
>        having this kind of failover functionality as fast as possible.
>        However, I think that after reaching some state which is
>        acceptable for the users needing the basic features fast, this
>        whole thing should be re-designed and ported to 2.6 and
>        nf_conntrack. This would depend on a few other things, such as
>        porting ctnetlink for nf_conntrack, but I thing those would
>        be important to have as well. Again, this would be quite a
>        lot work to do, thus deferring the 'stable' (production ready)
>        release of the code.
>

Are there any differences between the nfnetlink-ctnetlink patch and the
ctnetlink patch in the netfilter-ha repository ? Porting ctnetlink to
2.6 would be a start. Maybe someone wants to do it, otherwise I'll do
it on a rainy day ..

Regards
Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-22  0:40 ` Patrick McHardy
@ 2004-08-22  7:49   ` KOVACS Krisztian
  2004-08-22 20:42     ` Sven Schuster
  0 siblings, 1 reply; 55+ messages in thread
From: KOVACS Krisztian @ 2004-08-22  7:49 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: KOVACS Krisztian, Netfilter-failover list, netfilter-devel


   Hi,

Patrick McHardy wrote:
> Are there any differences between the nfnetlink-ctnetlink patch and the
> ctnetlink patch in the netfilter-ha repository ? Porting ctnetlink to
> 2.6 would be a start. Maybe someone wants to do it, otherwise I'll do
> it on a rainy day ..

   No, there is no difference apart from a small change I've already 
mentioned on netfilter-devel wrt the NATINFO notification. For details, see

http://lists.netfilter.org/pipermail/netfilter-devel/2004-August/016225.html

   AFAIK the ctnetlink patch has already been ported to 2.6, and a patch 
was sent to the mailing list. Unfortunately I was unable to find that 
mail in the mailing list archives...

-- 
   KOVÁCS, Krisztián

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-22  7:49   ` [nf-failover] " KOVACS Krisztian
@ 2004-08-22 20:42     ` Sven Schuster
  2004-08-23  9:51       ` Patrick McHardy
  0 siblings, 1 reply; 55+ messages in thread
From: Sven Schuster @ 2004-08-22 20:42 UTC (permalink / raw)
  To: KOVACS Krisztian
  Cc: Patrick McHardy, Netfilter-failover list, netfilter-devel


Hi Krisztian, hi Patrick,

On Sun, Aug 22, 2004 at 09:49:26AM +0200, KOVACS Krisztian told us:
>   AFAIK the ctnetlink patch has already been ported to 2.6, and a patch 
> was sent to the mailing list. Unfortunately I was unable to find that 
> mail in the mailing list archives...

look at this one :)

http://marc.theaimsgroup.com/?l=netfilter-devel&m=109154590603639&w=2

Still didn't find enough time to port it to a current 2.6 kernel. But
HTH!


Sven

-- 
Linux zion 2.6.8-rc2 #1 Sun Jul 18 15:00:48 CEST 2004 i686 athlon i386 GNU/Linux
 22:38:58  up 35 days, 7 min,  1 user,  load average: 0.00, 0.02, 0.00

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-19 11:06 ` Harald Welte
                     ` (2 preceding siblings ...)
  2004-08-19 16:13   ` Henrik Nordstrom
@ 2004-08-22 20:43   ` KOVACS Krisztian
  2004-08-24 18:37     ` Harald Welte
  3 siblings, 1 reply; 55+ messages in thread
From: KOVACS Krisztian @ 2004-08-22 20:43 UTC (permalink / raw)
  To: Harald Welte, KOVACS Krisztian, Netfilter-failover list,
	netfilter-devel, jamal


  Hi,

On Thu, Aug 19, 2004 at 01:06:46PM +0200, Harald Welte wrote:
> >         So, do anyone know of anything which could be used by ct_sync?
> >         (It has to be a semi-reliable, connectionless multicast protocol
> >         with a _very_ low overhead.)
> 
> everything I've seen so far about reliable multicast is inherently
> complex.

  Oops, I've just found TIPC. Does anyone know enough details of TIPC to
judge if its reliable multicast service would be useful for us? I've just
downloaded the IETF draft, and it seems to me that the reliable multicast
service provided by TIPC may be useful (section 2.9 of the draft). Any
ideas?

  The SourceForge URL:

  http://sourceforge.net/projects/tipc/

-- 
 KOVACS Krisztian

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-22 20:42     ` Sven Schuster
@ 2004-08-23  9:51       ` Patrick McHardy
  0 siblings, 0 replies; 55+ messages in thread
From: Patrick McHardy @ 2004-08-23  9:51 UTC (permalink / raw)
  To: Sven Schuster; +Cc: KOVACS Krisztian, Netfilter-failover list, netfilter-devel

Sven Schuster wrote:

>Hi Krisztian, hi Patrick,
>
>On Sun, Aug 22, 2004 at 09:49:26AM +0200, KOVACS Krisztian told us:
>  
>
>>  AFAIK the ctnetlink patch has already been ported to 2.6, and a patch 
>>was sent to the mailing list. Unfortunately I was unable to find that 
>>mail in the mailing list archives...
>>    
>>
>
>look at this one :)
>
>http://marc.theaimsgroup.com/?l=netfilter-devel&m=109154590603639&w=2
>
>Still didn't find enough time to port it to a current 2.6 kernel. But
>HTH!
>
Thanks for the info. I'm going to do the remaining work soon.

Regards
Patrick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-22 20:43   ` KOVACS Krisztian
@ 2004-08-24 18:37     ` Harald Welte
  2004-08-25 11:41       ` jamal
  0 siblings, 1 reply; 55+ messages in thread
From: Harald Welte @ 2004-08-24 18:37 UTC (permalink / raw)
  To: KOVACS Krisztian, Netfilter-failover list, netfilter-devel, jamal

[-- Attachment #1: Type: text/plain, Size: 1419 bytes --]

On Sun, Aug 22, 2004 at 10:43:26PM +0200, KOVACS Krisztian wrote:
> 
>   Hi,
> 
> On Thu, Aug 19, 2004 at 01:06:46PM +0200, Harald Welte wrote:
> > >         So, do anyone know of anything which could be used by ct_sync?
> > >         (It has to be a semi-reliable, connectionless multicast protocol
> > >         with a _very_ low overhead.)
> > 
> > everything I've seen so far about reliable multicast is inherently
> > complex.
> 
>   Oops, I've just found TIPC. Does anyone know enough details of TIPC to
> judge if its reliable multicast service would be useful for us? I've just
> downloaded the IETF draft, and it seems to me that the reliable multicast
> service provided by TIPC may be useful (section 2.9 of the draft). Any
> ideas?

Unfortunately I did only learn about TIPC recently.  But looking at the
IETF draft and the current implementation, I think it is probably too
expensive.  Another argument is to not base ct_sync on something big
outside of the official kernel tree that is not under our control..

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-24 18:37     ` Harald Welte
@ 2004-08-25 11:41       ` jamal
  0 siblings, 0 replies; 55+ messages in thread
From: jamal @ 2004-08-25 11:41 UTC (permalink / raw)
  To: Harald Welte; +Cc: netfilter-devel, Netfilter-failover list, KOVACS Krisztian


As Harald says its a little heavyweight for what you guys need to do.
And it doesnt run on IP proper.
We are going to explore and probably use TIPC for ForCES as one of the
options for transport in the case of non-IP.
For IP we are thinking a dual transport approach. Two sockets, one
UDP/multicast while other is TCP|SCTP/unicast reliable. 
Send always on UDP multicast; respond always on multicast. Reap the
benefits of multicast. Do NOT retransmit on multicast, retransmits of
either queries/updates etc are done over unicast.
This simple technique is borrowed from OSPF.

If you are interested we can discuss this more.

cheers,
jamal

On Tue, 2004-08-24 at 14:37, Harald Welte wrote:
> On Sun, Aug 22, 2004 at 10:43:26PM +0200, KOVACS Krisztian wrote:
> > 
> >   Hi,
> > 
> > On Thu, Aug 19, 2004 at 01:06:46PM +0200, Harald Welte wrote:
> > > >         So, do anyone know of anything which could be used by ct_sync?
> > > >         (It has to be a semi-reliable, connectionless multicast protocol
> > > >         with a _very_ low overhead.)
> > > 
> > > everything I've seen so far about reliable multicast is inherently
> > > complex.
> > 
> >   Oops, I've just found TIPC. Does anyone know enough details of TIPC to
> > judge if its reliable multicast service would be useful for us? I've just
> > downloaded the IETF draft, and it seems to me that the reliable multicast
> > service provided by TIPC may be useful (section 2.9 of the draft). Any
> > ideas?
> 
> Unfortunately I did only learn about TIPC recently.  But looking at the
> IETF draft and the current implementation, I think it is probably too
> expensive.  Another argument is to not base ct_sync on something big
> outside of the official kernel tree that is not under our control..

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-19 12:13   ` KOVACS Krisztian
@ 2004-08-26 10:00     ` Jozsef Kadlecsik
  2004-08-26 11:12       ` KOVACS Krisztian
  0 siblings, 1 reply; 55+ messages in thread
From: Jozsef Kadlecsik @ 2004-08-26 10:00 UTC (permalink / raw)
  To: KOVACS Krisztian; +Cc: Harald Welte, netfilter-devel, Netfilter-failover list

On Thu, 19 Aug 2004, KOVACS Krisztian wrote:

> > >      3. There are a few things in the connection tracking code which
> > >         are incompatible with replication "by design". For example,
> > >         the expectfn() function in the expectation structure is such:
> > >         simply, there is no way to replicate a stand-alone function
> > >         pointer which could point to any arbitrary function.
> >
> > Yes, indeed.  we could look up the symbol name in the symbol table and
> > replicate that ;)   Crude hack, but it would work.
>
>   Unfortunately I don't think it would work... AFAIK there is no symbol
> table information in 2.4 kernels and usually these functions are
> declared as static anyway. Moreover, take a look at the H.323 helper, it
> uses this expectfn function to set the helper of the conntrack to an
> unregistered helper structure... I think there is no point in making
> ct_sync overly complicated; these helpers should be fixed instead. (I
> don't know why the H.323 does things this way, but it is completely
> hopeless to replicate things like this.)

We could add the expect function to the ip_conntrack_helper structure and
identify it by the helper name in the update messages. The unregistered
helper in the H.323 conntrack/nat module could be registered with an
invalid, never matching port and let the expect function handle it as
before (because the real port is dynamic). I think that'd be sufficient in
solving the replication problem.

Unfortunately we cannot fix the H.323 protocol. :-)

> > > 	 One more example could be TCP window tracking, I don't think we
> > > 	 have the necessary bandwidth and CPU time to send an update
> > > 	 message after each and every received TCP packet... Any idea
> > > 	 how we could solve these problems?
> >
> > We already do this since the timeout is updated with every packet.  So
> > at this point, I see not much difference.  Jozsef and I agreed some time
> > in the past, that if we don't replicate all the window information, in
> > the event of a slave being propagated to master, the new master should
> > disable windowtracking or switch into a lazy mode.
>
>   Indeed, the timeout is updated with every packet. However, we do not
> generate an update message for IPCT_REFRESH events at the moment. On the
> other hand, we replicate the timeout changes when a state change occurs
> as well, so I don't worry about incorrect (relative) timeout values at
> all.

That's still fine with TCP window tracking. Just as Harald wrote, we can
switch to lazy mode on the slaves. But to be honest, that is the least
verified part of the window tracking code.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-26 10:00     ` Jozsef Kadlecsik
@ 2004-08-26 11:12       ` KOVACS Krisztian
  2004-08-26 11:39         ` Jozsef Kadlecsik
  0 siblings, 1 reply; 55+ messages in thread
From: KOVACS Krisztian @ 2004-08-26 11:12 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Harald Welte, netfilter-devel, Netfilter-failover list


  Hi,

2004-08-26, cs keltezéssel 12:00-kor Jozsef Kadlecsik ezt írta:
> >   Unfortunately I don't think it would work... AFAIK there is no symbol
> > table information in 2.4 kernels and usually these functions are
> > declared as static anyway. Moreover, take a look at the H.323 helper, it
> > uses this expectfn function to set the helper of the conntrack to an
> > unregistered helper structure... I think there is no point in making
> > ct_sync overly complicated; these helpers should be fixed instead. (I
> > don't know why the H.323 does things this way, but it is completely
> > hopeless to replicate things like this.)
> 
> We could add the expect function to the ip_conntrack_helper structure and
> identify it by the helper name in the update messages. The unregistered
> helper in the H.323 conntrack/nat module could be registered with an
> invalid, never matching port and let the expect function handle it as
> before (because the real port is dynamic). I think that'd be sufficient in
> solving the replication problem.

  Sounds good. This way the could replicate the expectfn function along
with the conntrack helper structure, and the unregistered helpers could
be handled as well. Although this might be a bit more complicated than
the current solution, but if we have to do some evil magic to handle
H.323, we should do that in a ct_sync compatible manner if possible...

> Unfortunately we cannot fix the H.323 protocol. :-)

  Of course, I tried to write 'I don't know why the H.323 _helper_ does
things this way' but made some mistakes... :(

> >   Indeed, the timeout is updated with every packet. However, we do not
> > generate an update message for IPCT_REFRESH events at the moment. On the
> > other hand, we replicate the timeout changes when a state change occurs
> > as well, so I don't worry about incorrect (relative) timeout values at
> > all.
> 
> That's still fine with TCP window tracking. Just as Harald wrote, we can
> switch to lazy mode on the slaves. But to be honest, that is the least
> verified part of the window tracking code.

  This would be an easy solution.

-- 
 Regards,
   Krisztian KOVACS

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-26 11:12       ` KOVACS Krisztian
@ 2004-08-26 11:39         ` Jozsef Kadlecsik
  2004-08-26 16:14           ` [nf-failover] " KOVACS Krisztian
  0 siblings, 1 reply; 55+ messages in thread
From: Jozsef Kadlecsik @ 2004-08-26 11:39 UTC (permalink / raw)
  To: KOVACS Krisztian; +Cc: Harald Welte, netfilter-devel, Netfilter-failover list

Hi,

On Thu, 26 Aug 2004, KOVACS Krisztian wrote:

> > We could add the expect function to the ip_conntrack_helper structure and
> > identify it by the helper name in the update messages. The unregistered
> > helper in the H.323 conntrack/nat module could be registered with an
> > invalid, never matching port and let the expect function handle it as
> > before (because the real port is dynamic). I think that'd be sufficient in
> > solving the replication problem.
>
>   Sounds good. This way the could replicate the expectfn function along
> with the conntrack helper structure, and the unregistered helpers could
> be handled as well. Although this might be a bit more complicated than
> the current solution, but if we have to do some evil magic to handle
> H.323, we should do that in a ct_sync compatible manner if possible...

Because the so far unregistered H.323 helper were registered, that would
be fully ct_sync compatible, without the need to modify anyting in
ct_sync. The core/ct_sync should be modified for the expectn only and
that's a general requirement, independent of the H.323 helper.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-26 11:39         ` Jozsef Kadlecsik
@ 2004-08-26 16:14           ` KOVACS Krisztian
  0 siblings, 0 replies; 55+ messages in thread
From: KOVACS Krisztian @ 2004-08-26 16:14 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Harald Welte, netfilter-devel, Netfilter-failover list


  Hi,

On Thu, Aug 26, 2004 at 01:39:33PM +0200, Jozsef Kadlecsik wrote:
> Because the so far unregistered H.323 helper were registered, that would
> be fully ct_sync compatible, without the need to modify anyting in
> ct_sync. The core/ct_sync should be modified for the expectn only and
> that's a general requirement, independent of the H.323 helper.

  In fact, ct_sync would not require any modifications, so I'd prefer this
solution.

-- 
 KOVACS Krisztian

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-13 14:26 [RFC] ct_sync 0.15 (corrected) KOVACS Krisztian
  2004-08-19 11:06 ` Harald Welte
  2004-08-22  0:40 ` Patrick McHardy
@ 2004-09-02  5:10 ` Willy Tarreau
  2004-09-02 12:39   ` KOVACS Krisztian
  2004-09-24  2:42 ` jamal
  3 siblings, 1 reply; 55+ messages in thread
From: Willy Tarreau @ 2004-09-02  5:10 UTC (permalink / raw)
  To: KOVACS Krisztian; +Cc: netfilter-devel, jamal, Netfilter-failover list

Hi,

>	One more
>         example could be TCP window tracking, I don't think we have
>         the necessary bandwidth and CPU time to send an update message
>         after each and every received TCP packet... Any idea how we
>         could solve these problems?

I don't really agree on the bandwidth argument : a small TCP packet with
only an ACK or a control flag is 40 bytes. This is 12 bytes more than the
smallest UDP packet, so if you can update a connection with at most 12 bytes,
you can use links of the same nature. BTW, I don't think you should send one
packet per connection. You should queue updates into a list, and build the
update packet from this list. This way, you eliminate the IP+UDP header (28
bytes) for all updates except the first one, which means you then have 40
bytes to update a connection without using more bandwidth. I'm not saying
this is much enough for every case, but I think that depending on the type
of message (creation, update, destruction), we might do things with this.

If you don't want to synchronize TCP windows, you might also turn the slave
in a "lazy mode" for existing connections when it becomes master. This is a
bit dirty, but might be an acceptable trade-off.

Regards,
Willy

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-02  5:10 ` Willy Tarreau
@ 2004-09-02 12:39   ` KOVACS Krisztian
  0 siblings, 0 replies; 55+ messages in thread
From: KOVACS Krisztian @ 2004-09-02 12:39 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netfilter-devel, jamal, Netfilter-failover list


  Hi,

2004-09-02, cs keltezéssel 07:10-kor Willy Tarreau ezt írta:
> >	One more
> >         example could be TCP window tracking, I don't think we have
> >         the necessary bandwidth and CPU time to send an update message
> >         after each and every received TCP packet... Any idea how we
> >         could solve these problems?
> 
> I don't really agree on the bandwidth argument : a small TCP packet with
> only an ACK or a control flag is 40 bytes. This is 12 bytes more than the
> smallest UDP packet, so if you can update a connection with at most 12 bytes,
> you can use links of the same nature. BTW, I don't think you should send one
> packet per connection. You should queue updates into a list, and build the
> update packet from this list. This way, you eliminate the IP+UDP header (28
> bytes) for all updates except the first one, which means you then have 40
> bytes to update a connection without using more bandwidth. I'm not saying
> this is much enough for every case, but I think that depending on the type
> of message (creation, update, destruction), we might do things with this.

  We already have such a queuing facility, a packet is sent only in case
of timeout (2s) or if it is full. The problem is the size of the
messages. At the moment, ct_sync always sends completely self-contained
updates: that is, a single update message contains every bit of
information about a conntrack entry. Because of this, the size of an
update message is 240 bytes (+ 4 byte for the message header). Because
of this, you can have about five update messages per packet, which is
not too much... (There are plans to implement a more fine-grained update
mechanism, but I don't know who will have the time to implement that.)

> If you don't want to synchronize TCP windows, you might also turn the slave
> in a "lazy mode" for existing connections when it becomes master. This is a
> bit dirty, but might be an acceptable trade-off.

  Yes, this is exactly what Jozsef suggested.

-- 
 Regards,
   Krisztian KOVACS

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] ct_sync 0.15 (corrected)
  2004-08-13 14:26 [RFC] ct_sync 0.15 (corrected) KOVACS Krisztian
                   ` (2 preceding siblings ...)
  2004-09-02  5:10 ` Willy Tarreau
@ 2004-09-24  2:42 ` jamal
  2004-09-25  7:52   ` [nf-failover] " Harald Welte
  3 siblings, 1 reply; 55+ messages in thread
From: jamal @ 2004-09-24  2:42 UTC (permalink / raw)
  To: KOVACS Krisztian; +Cc: netfilter-devel, Netfilter-failover list

Hi Krisztian,

I just glanced over your code (30 sec scan) and your state machine
doesnt allow for active/active (i.e two masters).
I havent actually run it - can you confirm this is impossible? 
if ct_sync was blind i.e it just did what it was told "become master" or
"become slave" regardless of who else is master, then it would be more
usable - leave policy to whatever tells it to switch.

cheers,
jamal

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-24  2:42 ` jamal
@ 2004-09-25  7:52   ` Harald Welte
  2004-09-27 13:07     ` jamal
  0 siblings, 1 reply; 55+ messages in thread
From: Harald Welte @ 2004-09-25  7:52 UTC (permalink / raw)
  To: jamal; +Cc: netfilter-devel, Netfilter-failover list, KOVACS Krisztian

[-- Attachment #1: Type: text/plain, Size: 1156 bytes --]

On Thu, Sep 23, 2004 at 10:42:19PM -0400, jamal wrote:
> Hi Krisztian,
> 
> I just glanced over your code (30 sec scan) and your state machine
> doesnt allow for active/active (i.e two masters).

yes, this is not a supported mode of operation in this first
implementation.

> I havent actually run it - can you confirm this is impossible? 
> if ct_sync was blind i.e it just did what it was told "become master" or
> "become slave" regardless of who else is master, then it would be more
> usable - leave policy to whatever tells it to switch.

well it exactly does this, with an additional security:  A master will
be downgraded to slave as soon as another master announces itself.  This
is a security guard against invalid mode of operation.

> cheers,
> jamal

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-25  7:52   ` [nf-failover] " Harald Welte
@ 2004-09-27 13:07     ` jamal
  2004-09-27 13:30       ` KOVACS Krisztian
  2004-09-27 13:39       ` Harald Welte
  0 siblings, 2 replies; 55+ messages in thread
From: jamal @ 2004-09-27 13:07 UTC (permalink / raw)
  To: Harald Welte; +Cc: netfilter-devel, Netfilter-failover list, KOVACS Krisztian

Hi Harald,

On Sat, 2004-09-25 at 03:52, Harald Welte wrote:

> > I havent actually run it - can you confirm this is impossible? 
> > if ct_sync was blind i.e it just did what it was told "become master" or
> > "become slave" regardless of who else is master, then it would be more
> > usable - leave policy to whatever tells it to switch.
> 
> well it exactly does this, with an additional security:  A master will
> be downgraded to slave as soon as another master announces itself.  This
> is a security guard against invalid mode of operation.

I think it would be better to separate the election process (who is
master) from the syncing code. For some reason i thought this separation
was there (and that all you had to do was bag some /proc entry). 
i.e if VRRP is the code that makes the decision that it wants you to be
the master, thats how you become master. If someothericandohabetter (eg
forCES) protocol wants you to be the master, thats how you become the
master. There is no point in inventing a new HA scheme.
And if you do it should probably be in user space (there is no
perfomance issues with it) i.e its such a protocol (in user space) that 
should tell your syncing code to send syncs or not.
Unless i missed something fundamental.

cheers,
jamal

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-27 13:07     ` jamal
@ 2004-09-27 13:30       ` KOVACS Krisztian
  2004-09-27 13:39       ` Harald Welte
  1 sibling, 0 replies; 55+ messages in thread
From: KOVACS Krisztian @ 2004-09-27 13:30 UTC (permalink / raw)
  To: hadi; +Cc: Harald Welte, netfilter-devel, Netfilter-failover list

  Hi,

2004-09-27, h keltezéssel 15:07-kor jamal ezt írta:
> > well it exactly does this, with an additional security:  A master will
> > be downgraded to slave as soon as another master announces itself.  This
> > is a security guard against invalid mode of operation.
> 
> I think it would be better to separate the election process (who is
> master) from the syncing code. For some reason i thought this separation
> was there (and that all you had to do was bag some /proc entry). 
> i.e if VRRP is the code that makes the decision that it wants you to be
> the master, thats how you become master. If someothericandohabetter (eg
> forCES) protocol wants you to be the master, thats how you become the
> master. There is no point in inventing a new HA scheme.
> And if you do it should probably be in user space (there is no
> perfomance issues with it) i.e its such a protocol (in user space) that 
> should tell your syncing code to send syncs or not.
> Unless i missed something fundamental.

  The election process is completely independent, as Harald already
mentioned. The procfs interface is provided so that ct_sync can be used
with any other cluster manager/failover daemon.

 However, ct_sync is not capable of load balancing at the moment, and
not just because the protocol has some things which are single-master
specific. The main problem is with NAT and preserving the uniqueness of
tuples in the whole cluster, and unfortunately this would make a lot of
things much more complicated. So, even if the protocol would be
completely multi-master compatible ct_sync would be capable of
single-master operation.

  You're right that the protocol itself was designed with failover in
mind, and it won't support load balancing clusters without modification.
This is mainly because of the sequence numbers, and could be easily
corrected by maintaining a per-node seqno state on each node. The
security guard against multiple masters was implemented just to make
sure that we won't have multiple masters even if some administrator is
experimenting with ct_sync and the proc interface.

  I don't consider these issues as fundamental flaws, we tried to make
the election process as independent from ct_sync as possible.

-- 
 Regards,
   Krisztian KOVACS

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-27 13:07     ` jamal
  2004-09-27 13:30       ` KOVACS Krisztian
@ 2004-09-27 13:39       ` Harald Welte
  2004-09-28  2:41         ` jamal
  1 sibling, 1 reply; 55+ messages in thread
From: Harald Welte @ 2004-09-27 13:39 UTC (permalink / raw)
  To: jamal; +Cc: netfilter-devel, Netfilter-failover list, KOVACS Krisztian

[-- Attachment #1: Type: text/plain, Size: 1775 bytes --]

On Mon, Sep 27, 2004 at 09:07:53AM -0400, jamal wrote:
> I think it would be better to separate the election process (who is
> master) from the syncing code. For some reason i thought this separation
> was there (and that all you had to do was bag some /proc entry). 
> i.e if VRRP is the code that makes the decision that it wants you to be
> the master, thats how you become master. If someothericandohabetter (eg
> forCES) protocol wants you to be the master, thats how you become the
> master. There is no point in inventing a new HA scheme.
> And if you do it should probably be in user space (there is no
> perfomance issues with it) i.e its such a protocol (in user space) that 
> should tell your syncing code to send syncs or not.
> Unless i missed something fundamental.

I totally agree with you, jamal.  And in fact this is exactly what we
have.  You tell one box it is master, and it becomes master.   You tell
a box it is slave, and it becomes slave.

There is just a minor addition in one case, where we want to safeguard
against a (currently) invalid mode of operation.  As soon as ct_sync
supports multiple master, this safeguard will certainly be removed.

But for the current code, unless somebody shows to me that it severely
limits some use of ct_sync, or it causes practical problems, I don't see
why we should remove this safeguard.

> cheers,
> jamal

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-27 13:39       ` Harald Welte
@ 2004-09-28  2:41         ` jamal
  2004-09-28  6:46           ` Henrik Nordstrom
  0 siblings, 1 reply; 55+ messages in thread
From: jamal @ 2004-09-28  2:41 UTC (permalink / raw)
  To: Harald Welte; +Cc: netfilter-devel, Netfilter-failover list, KOVACS Krisztian

Harald,

On Mon, 2004-09-27 at 09:39, Harald Welte wrote:

> I totally agree with you, jamal.  And in fact this is exactly what we
> have.  You tell one box it is master, and it becomes master.   You tell
> a box it is slave, and it becomes slave.
>
> There is just a minor addition in one case, where we want to safeguard
> against a (currently) invalid mode of operation.  As soon as ct_sync
> supports multiple master, this safeguard will certainly be removed.

So if understood correctly the issue is as described by Krisztian:

On Mon, 2004-09-27 at 09:30, KOVACS Krisztian wrote:
> The main problem is with NAT and preserving the uniqueness of
> tuples in the whole cluster, and unfortunately this would make a lot of
> things much more complicated. So, even if the protocol would be
> completely multi-master compatible ct_sync would be capable of
> single-master operation.

If you have two machines A and B, assuming they are symetric (exactly 
same internal and external IPs) then i should be able to send state from 
A->B and B->A and have both B and A updated (if such state doesnt exist).

With above if i decide i want to have two nodes as master, they both
generate and accept state update messages.

> But for the current code, unless somebody shows to me that it severely
> limits some use of ct_sync, or it causes practical problems, I don't see
> why we should remove this safeguard.

As i see it you need 3 states:
Master - accepts and generates sync messages
slave - only accepts syncs
init - unknown; does neither

If you didnt have this safeguard then i should be able to achive
master/master on two nodes (even if for starters i assume symetric
setup). ct_sync in itself should not attempt to be too smart and have a
built-in protocol IMO - there is no point in reinventing the wheel;
people have spent years researching HA protocols, good idea to just use
that.

cheers,
jamal

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28  2:41         ` jamal
@ 2004-09-28  6:46           ` Henrik Nordstrom
  2004-09-28 10:56             ` jamal
  2004-09-28 11:58             ` Tobias DiPasquale
  0 siblings, 2 replies; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-28  6:46 UTC (permalink / raw)
  To: jamal
  Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
	KOVACS Krisztian

On Tue, 27 Sep 2004, jamal wrote:

> On Mon, 2004-09-27 at 09:30, KOVACS Krisztian wrote:
>> The main problem is with NAT and preserving the uniqueness of
>> tuples in the whole cluster, and unfortunately this would make a lot of
>> things much more complicated. So, even if the protocol would be
>> completely multi-master compatible ct_sync would be capable of
>> single-master operation.
>
> If you have two machines A and B, assuming they are symetric (exactly
> same internal and external IPs) then i should be able to send state from
> A->B and B->A and have both B and A updated (if such state doesnt exist).

No, this is about a different issue entirely.

Lets assume you have two Active-Active gateways G and H, two clients A and 
B and one server S. On the gateway NAT is used to masquerade all traffic 
to a single external IP address.

Due to the Active-Active setup traffic from A goes via the gateway G and 
traffic from B goes via H.

Now you have a SYN from A,31285 to S,80 and also a SYN sent by B,31285 to 
S,80. You then end up with two identical NAT assignments and the two 
connections will conflict with each other.

> With above if i decide i want to have two nodes as master, they both
> generate and accept state update messages.

Which in itself is not an issue, but the issue is how to ensure these 
updates does not conflict with each other as in the example above.

The active-active or active-backup aspect of the syncronization protocol 
is trivial. How to ensure there won't be serious session conflicts between 
the connection information of the two gateways is the tricky part in order 
to be able to provide active-active configurations.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28  6:46           ` Henrik Nordstrom
@ 2004-09-28 10:56             ` jamal
  2004-09-28 12:24               ` KOVACS Krisztian
  2004-09-28 11:58             ` Tobias DiPasquale
  1 sibling, 1 reply; 55+ messages in thread
From: jamal @ 2004-09-28 10:56 UTC (permalink / raw)
  To: Henrik Nordstrom
  Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
	KOVACS Krisztian

On Tue, 2004-09-28 at 02:46, Henrik Nordstrom wrote:

> Lets assume you have two Active-Active gateways G and H, two clients A and 
> B and one server S. On the gateway NAT is used to masquerade all traffic 
> to a single external IP address.
> 
> Due to the Active-Active setup traffic from A goes via the gateway G and 
> traffic from B goes via H.
> 
> Now you have a SYN from A,31285 to S,80 and also a SYN sent by B,31285 to 
> S,80. You then end up with two identical NAT assignments and the two 
> connections will conflict with each other.

if we even look at the 5 tuples {srcIP,DstIP, proto, srcport,dstport} we
already have a distinction, no?
i.e in the example you provide srcIP would be different.

I can see an issue if those 5 tuples match and you have to find
something else to distinguish them since Linux contracking doesnt keep
track of TCP sequence numbers and window dilations. If it did i dont see
why this would be a problem. I think i am having a hard time visualizing
when you would even need to kick in sequence number checks,

> > With above if i decide i want to have two nodes as master, they both
> > generate and accept state update messages.
> 
> Which in itself is not an issue, but the issue is how to ensure these 
> updates does not conflict with each other as in the example above.
> 
> The active-active or active-backup aspect of the syncronization protocol 
> is trivial. How to ensure there won't be serious session conflicts between 
> the connection information of the two gateways is the tricky part in order 
> to be able to provide active-active configurations.

See my comment above.
I dont think i see a serious issue of conflict. I may be missing
something of course. At least the srcIP may endup being a tiebreaker.
To get exactly the same 5 tuples from the same physical machine for a
different flow is impossible i would think. 
Again i may be missing something.

cheers,
jamal

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28  6:46           ` Henrik Nordstrom
  2004-09-28 10:56             ` jamal
@ 2004-09-28 11:58             ` Tobias DiPasquale
  2004-09-28 12:11               ` KOVACS Krisztian
  2004-09-28 12:31               ` Henrik Nordstrom
  1 sibling, 2 replies; 55+ messages in thread
From: Tobias DiPasquale @ 2004-09-28 11:58 UTC (permalink / raw)
  To: Henrik Nordstrom; +Cc: nf-devel, netfilter-failover

On Tue, 28 Sep 2004 08:46:25 +0200 (CEST), Henrik Nordstrom
<hno@marasystems.com> wrote:
> No, this is about a different issue entirely.
> 
> Lets assume you have two Active-Active gateways G and H, two clients A and
> B and one server S. On the gateway NAT is used to masquerade all traffic
> to a single external IP address.
> 
> Due to the Active-Active setup traffic from A goes via the gateway G and
> traffic from B goes via H.
> 
> Now you have a SYN from A,31285 to S,80 and also a SYN sent by B,31285 to
> S,80. You then end up with two identical NAT assignments and the two
> connections will conflict with each other.

Why use NAT at all for active-active? Its pretty slow in comparison to
the shared MAC/IP schema delineated at UltraMonkey.org:

http://www.ultramonkey.org/papers/active_active/active_active.shtml

Am I missing something? Is NAT required for some reason?

-- 
[ Tobias DiPasquale ]
0x636f6465736c696e67657240676d61696c2e636f6d

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 11:58             ` Tobias DiPasquale
@ 2004-09-28 12:11               ` KOVACS Krisztian
  2004-09-28 12:31               ` Henrik Nordstrom
  1 sibling, 0 replies; 55+ messages in thread
From: KOVACS Krisztian @ 2004-09-28 12:11 UTC (permalink / raw)
  To: Tobias DiPasquale; +Cc: nf-devel, Netfilter-failover list, Henrik Nordstrom


  Hi,

2004-09-28, k keltezéssel 13:58-kor Tobias DiPasquale ezt írta:
> On Tue, 28 Sep 2004 08:46:25 +0200 (CEST), Henrik Nordstrom
> <hno@marasystems.com> wrote:
> > No, this is about a different issue entirely.
> > 
> > Lets assume you have two Active-Active gateways G and H, two clients A and
> > B and one server S. On the gateway NAT is used to masquerade all traffic
> > to a single external IP address.
> > 
> > Due to the Active-Active setup traffic from A goes via the gateway G and
> > traffic from B goes via H.
> > 
> > Now you have a SYN from A,31285 to S,80 and also a SYN sent by B,31285 to
> > S,80. You then end up with two identical NAT assignments and the two
> > connections will conflict with each other.
> 
> Why use NAT at all for active-active? Its pretty slow in comparison to
> the shared MAC/IP schema delineated at UltraMonkey.org:
> 
> http://www.ultramonkey.org/papers/active_active/active_active.shtml
> 
> Am I missing something? Is NAT required for some reason?

  Ok, but this is not redirector failover, ct_sync is simply a general
purpose conntrack state replication solution. And as such, it should be
able to handle NAT-related conntrack data as well. If you have a
multi-master (load balancing) packet filter cluster it still has to be
able to do anything you can do with a single node.

-- 
 Regards,
   Krisztian KOVACS

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 10:56             ` jamal
@ 2004-09-28 12:24               ` KOVACS Krisztian
  2004-09-28 12:35                 ` Henrik Nordstrom
  0 siblings, 1 reply; 55+ messages in thread
From: KOVACS Krisztian @ 2004-09-28 12:24 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
	Henrik Nordstrom

  Hi,

2004-09-28, k keltezéssel 12:56-kor jamal ezt írta:
> On Tue, 2004-09-28 at 02:46, Henrik Nordstrom wrote:
> 
> > Lets assume you have two Active-Active gateways G and H, two clients A and 
> > B and one server S. On the gateway NAT is used to masquerade all traffic 
> > to a single external IP address.
> > 
> > Due to the Active-Active setup traffic from A goes via the gateway G and 
> > traffic from B goes via H.
> > 
> > Now you have a SYN from A,31285 to S,80 and also a SYN sent by B,31285 to 
> > S,80. You then end up with two identical NAT assignments and the two 
> > connections will conflict with each other.
> 
> if we even look at the 5 tuples {srcIP,DstIP, proto, srcport,dstport} we
> already have a distinction, no?
> i.e in the example you provide srcIP would be different.
> 
> I can see an issue if those 5 tuples match and you have to find
> something else to distinguish them since Linux contracking doesnt keep
> track of TCP sequence numbers and window dilations. If it did i dont see
> why this would be a problem. I think i am having a hard time visualizing
> when you would even need to kick in sequence number checks,

  Not necessarily. You cannot (easily) decide which conntrack the reply
packets belong to. So, in the above scenario the following is perfectly
possible:

A -------- G\
             --------S
B -------- H/

Let's suppose the SYN packages from A and B arrive to G and H
simultaneously, so that neither G nor H knows anything about the other
connection yet. When the NAT core searches for a suitable new source
address for the connection, it is possible that each node will choose
exactly the same new source IP:port pair. (Because they obviously do
uniqueness checks based only on their own conntrack state table.) And if
the two connections were destined to the same IP:port, the reply packets
for both connections will look exactly the same. In case of TCP you
could probably make guesses based on sequence numbers and such, but what
would you do in case of other protocols?

  The problem could be circumvented if we statically partitioned the
address space between the nodes in the cluster. Unfortunately this is
not so simple as it sounds, since it is possible to have untranslated
connections using the possibly clasing tuples as well... (Maybe we could
apply implicit SNAT translations in this case?)

-- 
 Regards,
   Krisztian KOVACS

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 11:58             ` Tobias DiPasquale
  2004-09-28 12:11               ` KOVACS Krisztian
@ 2004-09-28 12:31               ` Henrik Nordstrom
  1 sibling, 0 replies; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-28 12:31 UTC (permalink / raw)
  To: Tobias DiPasquale; +Cc: nf-devel, netfilter-failover

On Tue, 28 Sep 2004, Tobias DiPasquale wrote:

> Why use NAT at all for active-active? Its pretty slow in comparison to
> the shared MAC/IP schema delineated at UltraMonkey.org:

We are talking firewalls here, not loadbalancers.

NAT of the traffic forwarded, not NAT to reach the box.

> Am I missing something? Is NAT required for some reason?

By the firewall policy implemented by the firewall.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 12:24               ` KOVACS Krisztian
@ 2004-09-28 12:35                 ` Henrik Nordstrom
  2004-09-28 12:57                   ` KOVACS Krisztian
  0 siblings, 1 reply; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-28 12:35 UTC (permalink / raw)
  To: KOVACS Krisztian
  Cc: Harald Welte, netfilter-devel, hadi, Netfilter-failover list

On Tue, 28 Sep 2004, KOVACS Krisztian wrote:

> The problem could be circumvented if we statically partitioned the
> address space between the nodes in the cluster. Unfortunately this is
> not so simple as it sounds, since it is possible to have untranslated
> connections using the possibly clasing tuples as well... (Maybe we could
> apply implicit SNAT translations in this case?)

I think for the active-active case the only viable setup is to enforce 
strict address separation, with the addresses used for NAT not used for 
anything else, and unique per firewall in the active-active cluster.

This is not as bad as it sounds as the traffic needs to be partitioned as 
well. We certainly do not want to see assymetric flows in conntrack where 
traffic goes out via one gateway and returns on another.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 12:35                 ` Henrik Nordstrom
@ 2004-09-28 12:57                   ` KOVACS Krisztian
  2004-09-28 13:14                     ` jamal
  2004-09-28 13:58                     ` Henrik Nordstrom
  0 siblings, 2 replies; 55+ messages in thread
From: KOVACS Krisztian @ 2004-09-28 12:57 UTC (permalink / raw)
  To: Henrik Nordstrom
  Cc: Harald Welte, netfilter-devel, hadi, Netfilter-failover list


  Hi,

2004-09-28, k keltezéssel 14:35-kor Henrik Nordstrom ezt írta:
> > The problem could be circumvented if we statically partitioned the
> > address space between the nodes in the cluster. Unfortunately this is
> > not so simple as it sounds, since it is possible to have untranslated
> > connections using the possibly clasing tuples as well... (Maybe we could
> > apply implicit SNAT translations in this case?)
> 
> I think for the active-active case the only viable setup is to enforce 
> strict address separation, with the addresses used for NAT not used for 
> anything else, and unique per firewall in the active-active cluster.
> 
> This is not as bad as it sounds as the traffic needs to be partitioned as 
> well. We certainly do not want to see assymetric flows in conntrack where 
> traffic goes out via one gateway and returns on another.

  There are other solutions for that problem, for example Harald's
ClusterIP code. If we could integrate that with ct_sync we would be able
to do multi-master packet filter clusters without any load balancers
before the cluster. If the NAT core would be integrated with ClusterIP's
hash to avoid conntrack clashes we could do this without statically
assigning different NAT addresses to each node.

-- 
 Regards,
   Krisztian KOVACS

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 12:57                   ` KOVACS Krisztian
@ 2004-09-28 13:14                     ` jamal
       [not found]                       ` <1096379957.1026.5.camel@jzny.localdomain>
  2004-09-28 13:58                     ` Henrik Nordstrom
  1 sibling, 1 reply; 55+ messages in thread
From: jamal @ 2004-09-28 13:14 UTC (permalink / raw)
  To: KOVACS Krisztian
  Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
	Henrik Nordstrom


BTW, thanks to both of you - I got what the challenge is now.

For some reason - I am still missing pieces of the
1) --> A sends update for IPx:portY
2) --> B updates its state with new pair.
3) --> B generates update for IPx:portz
4) --> A updates its state with new pair.

in #2 above B reserves that space and never uses it (same in #3 for A).
In otherwords when B generates #3, it ensures no conflict by
definition.
You may need to synchronize and generate a "conflict detected" flag in
updates; i would suspect very little conflict though.

cheers,
jamal

On Tue, 2004-09-28 at 08:57, KOVACS Krisztian wrote:
>   Hi,
> 
> 2004-09-28, k keltezéssel 14:35-kor Henrik Nordstrom ezt írta:
> > > The problem could be circumvented if we statically partitioned the
> > > address space between the nodes in the cluster. Unfortunately this is
> > > not so simple as it sounds, since it is possible to have untranslated
> > > connections using the possibly clasing tuples as well... (Maybe we could
> > > apply implicit SNAT translations in this case?)
> > 
> > I think for the active-active case the only viable setup is to enforce 
> > strict address separation, with the addresses used for NAT not used for 
> > anything else, and unique per firewall in the active-active cluster.
> > 
> > This is not as bad as it sounds as the traffic needs to be partitioned as 
> > well. We certainly do not want to see assymetric flows in conntrack where 
> > traffic goes out via one gateway and returns on another.
> 
>   There are other solutions for that problem, for example Harald's
> ClusterIP code. If we could integrate that with ct_sync we would be able
> to do multi-master packet filter clusters without any load balancers
> before the cluster. If the NAT core would be integrated with ClusterIP's
> hash to avoid conntrack clashes we could do this without statically
> assigning different NAT addresses to each node.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 12:57                   ` KOVACS Krisztian
  2004-09-28 13:14                     ` jamal
@ 2004-09-28 13:58                     ` Henrik Nordstrom
  2004-09-28 14:24                       ` Tobias DiPasquale
  1 sibling, 1 reply; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-28 13:58 UTC (permalink / raw)
  To: KOVACS Krisztian
  Cc: Harald Welte, netfilter-devel, hadi, Netfilter-failover list

On Tue, 28 Sep 2004, KOVACS Krisztian wrote:

> There are other solutions for that problem, for example Harald's 
> ClusterIP code. If we could integrate that with ct_sync we would be able 
> to do multi-master packet filter clusters without any load balancers 
> before the cluster. If the NAT core would be integrated with ClusterIP's 
> hash to avoid conntrack clashes we could do this without statically 
> assigning different NAT addresses to each node.

Any ideas on how would this work?

Lets reason around the common MASQUERADE case where an internal network 
needs to be SNAT:ed when going out to the Internet.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 13:58                     ` Henrik Nordstrom
@ 2004-09-28 14:24                       ` Tobias DiPasquale
  0 siblings, 0 replies; 55+ messages in thread
From: Tobias DiPasquale @ 2004-09-28 14:24 UTC (permalink / raw)
  To: Henrik Nordstrom
  Cc: Harald Welte, netfilter-devel, hadi, Netfilter-failover list,
	KOVACS Krisztian

On Tue, 28 Sep 2004 15:58:52 +0200 (CEST), Henrik Nordstrom
<hno@marasystems.com> wrote:
> On Tue, 28 Sep 2004, KOVACS Krisztian wrote:
> 
> > There are other solutions for that problem, for example Harald's
> > ClusterIP code. If we could integrate that with ct_sync we would be able
> > to do multi-master packet filter clusters without any load balancers
> > before the cluster. If the NAT core would be integrated with ClusterIP's
> > hash to avoid conntrack clashes we could do this without statically
> > assigning different NAT addresses to each node.
> 
> Any ideas on how would this work?
> 
> Lets reason around the common MASQUERADE case where an internal network
> needs to be SNAT:ed when going out to the Internet.

Forgive me for bringing this back up, but...

I believe that Saru handles this problem by assigning "blocks" (a
block being a fixed-sized range of units, e.g. 512 source ports in
sequence) of IPs and ports to various nodes in the cluster and each
node only handles the IP/ports in its assigned blocks. The lookup is
just a bitop so its fast and this would handle the MASQUERADE case
mentioned above nicely. The blocks are handed out by a userspace
daemon as nodes enter and leave the cluster.

Would this not work?

-- 
[ Tobias DiPasquale ]
0x636f6465736c696e67657240676d61696c2e636f6d

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
       [not found]                       ` <1096379957.1026.5.camel@jzny.localdomain>
@ 2004-09-28 14:46                         ` Henrik Nordstrom
  2004-09-28 14:56                           ` KOVACS Krisztian
  0 siblings, 1 reply; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-28 14:46 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
	KOVACS Krisztian

On Tue, 28 Sep 2004, Jamal Hadi Salim wrote:

> Ok, I think this is Henrik may be saying as well:
> For each of the active nodes, you give it a range of ports to use.
> Maybe 1K ports each to start and then the ct_sync allows for nodes to
> ask for more. A port marked as "in use" by other nodes should not be
> used for simplicity. A node should also gc ports to the general pool.
> Thoughts?

Any controlled division of the tuple address space would solve the 
NAT problem.

I would use IP addresses in this scheme.. it is very nice to have NAT as 
non-intrusive as possible preserving what can be preserved of the original 
tuple.

There remains some delicate thinking on how to manage the traffic flows in 
a sane manner to make sure the correct traffic is forwarded by the correct 
node, considering failovers, recoveries etc.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 14:46                         ` Henrik Nordstrom
@ 2004-09-28 14:56                           ` KOVACS Krisztian
  2004-09-28 15:07                             ` Henrik Nordstrom
  0 siblings, 1 reply; 55+ messages in thread
From: KOVACS Krisztian @ 2004-09-28 14:56 UTC (permalink / raw)
  To: Henrik Nordstrom
  Cc: Harald Welte, Jamal Hadi Salim, netfilter-devel,
	Netfilter-failover list

  Hi,

2004-09-28, k keltezéssel 16:46-kor Henrik Nordstrom ezt írta:
> Any controlled division of the tuple address space would solve the 
> NAT problem.
> 
> I would use IP addresses in this scheme.. it is very nice to have NAT as 
> non-intrusive as possible preserving what can be preserved of the original 
> tuple.

  Definitely. However, I don't see how it would be possible to use
MASQUERADE and a single public IP in this case. If you use the complete
reply tuple and some hash function to avoid two nodes using the same
reply tuple that would be a bit more capable. (Similar to that Jamal is
saying: the unique tuple allocation code would take care of allocating a
tuple whose hash value the node "owns". This part would be really
similar to ClusterIP.)

> There remains some delicate thinking on how to manage the traffic flows in 
> a sane manner to make sure the correct traffic is forwarded by the correct 
> node, considering failovers, recoveries etc.

  Yes, of course. Full re-sync would be a somewhat more complicated
problem as well. But if we maintain some per-conntrack mark indicating
which node "owns" that entry, then even full re-sync could be
implemented quite easily: each node dumps all entries it is responsible
for. The protocol itself should be extended as well, we would need
per-node sequence numbers and per-node recovery requests.

-- 
 Regards,
   Krisztian KOVACS

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 14:56                           ` KOVACS Krisztian
@ 2004-09-28 15:07                             ` Henrik Nordstrom
  2004-09-28 18:04                               ` Sven Schuster
  2004-09-29  2:14                               ` Jamal Hadi Salim
  0 siblings, 2 replies; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-28 15:07 UTC (permalink / raw)
  To: KOVACS Krisztian
  Cc: Harald Welte, Jamal Hadi Salim, netfilter-devel,
	Netfilter-failover list

On Tue, 28 Sep 2004, KOVACS Krisztian wrote:

>> I would use IP addresses in this scheme.. it is very nice to have NAT as
>> non-intrusive as possible preserving what can be preserved of the original
>> tuple.
>
> Definitely. However, I don't see how it would be possible to use
> MASQUERADE and a single public IP in this case.

If you only have a single IP using division by IP addresses is clearly not 
an option.

>> There remains some delicate thinking on how to manage the traffic flows in
>> a sane manner to make sure the correct traffic is forwarded by the correct
>> node, considering failovers, recoveries etc.
>
> Yes, of course. Full re-sync would be a somewhat more complicated
> problem as well.

The full re-sync is just one more facet of the same problem. If one is 
solved you can solve both. It needs to be known per connection which node 
is currently the master, and the addressing/forwarding scheme needs to 
make sure the node who is master for a connection sees the traffic it 
needs to se.

> But if we maintain some per-conntrack mark indicating
> which node "owns" that entry, then even full re-sync could be
> implemented quite easily: each node dumps all entries it is responsible
> for. The protocol itself should be extended as well, we would need
> per-node sequence numbers and per-node recovery requests.

Indeed.

And the flow question of existing connections is also not such big problem 
as all you need is to look up the connection in conntrack and if not 
master you either drop the packet (multicast forwarding model) or 
request to become master of the connection (unicast forwarding model). 
The multicast forwarding model is a lot easier to implement but does not 
scale as well as all nodes need to see all traffic and drop the traffic 
for which they are not master.

This moves the "clusterIP" decision to after the conntrack lookup but 
before creating new conntrack sessions.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 15:07                             ` Henrik Nordstrom
@ 2004-09-28 18:04                               ` Sven Schuster
  2004-09-28 18:47                                 ` Henrik Nordstrom
  2004-09-29  2:14                               ` Jamal Hadi Salim
  1 sibling, 1 reply; 55+ messages in thread
From: Sven Schuster @ 2004-09-28 18:04 UTC (permalink / raw)
  To: Henrik Nordstrom
  Cc: Harald Welte, Jamal Hadi Salim, netfilter-devel,
	Netfilter-failover list, KOVACS Krisztian

[-- Attachment #1: Type: text/plain, Size: 2103 bytes --]

Hi Henry, hi list members,

On Tue, Sep 28, 2004 at 05:07:50PM +0200, Henrik Nordstrom told us:
> And the flow question of existing connections is also not such big problem 
> as all you need is to look up the connection in conntrack and if not 
> master you either drop the packet (multicast forwarding model) or 
> request to become master of the connection (unicast forwarding model). 
> The multicast forwarding model is a lot easier to implement but does not 
> scale as well as all nodes need to see all traffic and drop the traffic 
> for which they are not master.

just jumping in here:

Shouldn't a node be able to handle a connection for which it isn't the master?? Say,
if we have just one public IP address, we must divide our source ports for NAT into,
say N parts for N firewall cluster nodes. Now, shouldn't the division of the ports
just be to avoid ambiguous tuples on the cluster at connection (NAT) setup time?? If
a node got an update for a tuple, then it should be able to handle this connection
fully, no??

Ok, now (2 seconds ago :) it just made 'click' in my brain. This of course only
applies when it is guaranteed that one (reply) packet is seen by only one node in
our firewall cluster.  Of course, when we have some kind of multicasting so that
each node will see all the traffic, this won't work and we have to make sure that
only one node (preferrably the master of this connection) is handling the packet.
But, as you wrote, multicasting does not scale as well as unicast. And when using
multicast, we also would have to make sure that the initial packet which causes
a new connection to be set up is handled by only one node...how??

just my foggy thoughts and some interesting open questions to think about :-)

Sven

> 
> This moves the "clusterIP" decision to after the conntrack lookup but 
> before creating new conntrack sessions.
> 
> Regards
> Henrik
> 

-- 
Linux zion 2.6.9-rc1-mm4 #1 Tue Sep 7 12:57:19 CEST 2004 i686 athlon i386 GNU/Linux
 19:46:21  up 2 days, 18:59,  2 users,  load average: 0.03, 0.02, 0.02

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 18:04                               ` Sven Schuster
@ 2004-09-28 18:47                                 ` Henrik Nordstrom
  2004-09-28 20:57                                   ` Sven Schuster
  0 siblings, 1 reply; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-28 18:47 UTC (permalink / raw)
  To: Sven Schuster
  Cc: Harald Welte, Jamal Hadi Salim, netfilter-devel,
	Netfilter-failover list, KOVACS Krisztian

On Tue, 28 Sep 2004, Sven Schuster wrote:

> Shouldn't a node be able to handle a connection for which it isn't the master??

No.

> Say, if we have just one public IP address, we must divide our source 
> ports for NAT into, say N parts for N firewall cluster nodes. Now, 
> shouldn't the division of the ports just be to avoid ambiguous tuples on 
> the cluster at connection (NAT) setup time?? If a node got an update for 
> a tuple, then it should be able to handle this connection fully, no??

You are mixing up NAT assignment and being master of a connection. The two 
are separate.

> Ok, now (2 seconds ago :) it just made 'click' in my brain. This of course only
> applies when it is guaranteed that one (reply) packet is seen by only one node in
> our firewall cluster.

Which is not the case. In a multicast style setup all firewalls sees all 
traffic. Each firewall needs to be able to determine what of this traffic 
it is supposed to forward.

> Of course, when we have some kind of multicasting so that
> each node will see all the traffic, this won't work and we have to make sure that
> only one node (preferrably the master of this connection) is handling the packet.
> But, as you wrote, multicasting does not scale as well as unicast.

Correct.

> And when using multicast, we also would have to make sure that the 
> initial packet which causes a new connection to be set up is handled by 
> only one node...how??

By hashing of the original tuple, dividing the address space among the 
currently active firewalls.


The main problem for unicast active-active firewalls is how to handle load 
distribution, failover and recovery in an sane manner without requiring a 
load balancer infront of the firewalls.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 18:47                                 ` Henrik Nordstrom
@ 2004-09-28 20:57                                   ` Sven Schuster
  2004-09-28 22:30                                     ` Tobias DiPasquale
  0 siblings, 1 reply; 55+ messages in thread
From: Sven Schuster @ 2004-09-28 20:57 UTC (permalink / raw)
  To: Henrik Nordstrom
  Cc: Harald Welte, Jamal Hadi Salim, netfilter-devel,
	Netfilter-failover list, KOVACS Krisztian

[-- Attachment #1: Type: text/plain, Size: 1625 bytes --]

Hi Henrik,

On Tue, Sep 28, 2004 at 08:47:15PM +0200, Henrik Nordstrom told us:
> >And when using multicast, we also would have to make sure that the 
> >initial packet which causes a new connection to be set up is handled by 
> >only one node...how??
> 
> By hashing of the original tuple, dividing the address space among the 
> currently active firewalls.
> 
> 
> The main problem for unicast active-active firewalls is how to handle load 
> distribution, failover and recovery in an sane manner without requiring a 
> load balancer infront of the firewalls.

I just thought about this all some more the last few hours and came to
kind of like the same conclusions. While unicast clustering surely
might be more scalable because each node just sees the traffic it is
responsible for, the problem is that the whole traffic must be split
before the firewall cluster. So one would need another special high tech
don't-know-what switch or something like that, so another possible point
of failure...with multicasting, this won't be a problem, as the traffic
will be automagically distruted among the nodes. And if one of the nodes
goes down for some reason, the other nodes will also take over this
nodes' traffic as soon as they realize this nodes' failure and so one of
the hashing function's parameters is adjusted to represent the new
situation...!?

Thanks for enlightening my, Henrik :-)

Sven

> 
> Regards
> Henrik

-- 
Linux zion 2.6.9-rc1-mm4 #1 Tue Sep 7 12:57:19 CEST 2004 i686 athlon i386 GNU/Linux
 22:41:48  up 2 days, 21:55,  2 users,  load average: 0.10, 0.04, 0.01

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 20:57                                   ` Sven Schuster
@ 2004-09-28 22:30                                     ` Tobias DiPasquale
  2004-09-28 23:36                                       ` Henrik Nordstrom
  0 siblings, 1 reply; 55+ messages in thread
From: Tobias DiPasquale @ 2004-09-28 22:30 UTC (permalink / raw)
  To: Sven Schuster
  Cc: Henrik Nordstrom, KOVACS Krisztian, Harald Welte, netfilter-devel,
	Netfilter-failover list, Jamal Hadi Salim

On Tue, 28 Sep 2004 22:57:23 +0200, Sven Schuster <schuster.sven@gmx.de> wrote:
> And if one of the nodes
> goes down for some reason, the other nodes will also take over this
> nodes' traffic as soon as they realize this nodes' failure and so one of
> the hashing function's parameters is adjusted to represent the new
> situation...!?

This is exactly what Saru does.

Also, I'm not exactly sure how you can do active-active without either
multicast or multicast-by-way-of-MAC-address-cloning. This is because
of the fact that I'm not really sure how you can do active-active
without all of the nodes having the same IP address. (otherwise, what
would it be?)

Henrik?

-- 
[ Tobias DiPasquale ]
0x636f6465736c696e67657240676d61696c2e636f6d

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 22:30                                     ` Tobias DiPasquale
@ 2004-09-28 23:36                                       ` Henrik Nordstrom
  2004-09-29  3:00                                         ` Tobias DiPasquale
  0 siblings, 1 reply; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-28 23:36 UTC (permalink / raw)
  To: Tobias DiPasquale
  Cc: KOVACS Krisztian, Harald Welte, netfilter-devel, Sven Schuster,
	Netfilter-failover list, Jamal Hadi Salim

On Tue, 28 Sep 2004, Tobias DiPasquale wrote:

> Also, I'm not exactly sure how you can do active-active without either
> multicast or multicast-by-way-of-MAC-address-cloning. This is because
> of the fact that I'm not really sure how you can do active-active
> without all of the nodes having the same IP address. (otherwise, what
> would it be?)

There is many ways to get traffic on an node, each having their 
complications and limitations.

For unicast you need to find some method to divide the address space using 
existing metrics supported by your network. If we assume the trivial setup 
of an HA active-active server then this can for example be done by routing 
the traffic based on the source IP or by client based balancing by 
publishing the service using multiple IPs. In a forwarding NAT firewall it 
gets a little more complex as you have two sides with slightly different 
metrics to care about.  And N node setups gets even more complex as you 
then need to be able to divide and merge contexts across several nodes 
when adding/removing nodes unless you accept a imbalance in the load if a 
node member has failed (fixed contexts, where one node at a time is master 
per context).

If you do not have any reasonable support for balancing in your network 
layer then you can either introduce a load balancer layer or use multicast 
(broadcast MAC, or by MAC cloning if supported by the network equipment.. 
not all switches like MAC cloning)

The multicast approach provides full flexibility in how connections are 
migrated among the nodes, but does not scale very well as all nodes sees 
all traffic.

A simple unicast load balancing which does work with most equipment is 
policy routing based on IP's. Ports is also doable in most equipment but 
limits the protocols which can be supported.

Statically divide the "clients" in N groups, each group assigned to one 
virtual firewall with a virtual MAC address.

failover this virtual firewall among the nodes in the cluster as needed. 
there may only be one master at a time per virtual firewall but each node 
can be master for several virtual firewalls.

connections need to be ct_sync:ed from the current master to all potential 
backups, multiplied by each virtual firewall.

For virtual MAC address support to Linux see the mac_vlan patch. Allows 
you to create any number of virtual ethernet interfaces per real 
interface, each with their own MAC.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 15:07                             ` Henrik Nordstrom
  2004-09-28 18:04                               ` Sven Schuster
@ 2004-09-29  2:14                               ` Jamal Hadi Salim
  2004-09-29  8:12                                 ` Henrik Nordstrom
  1 sibling, 1 reply; 55+ messages in thread
From: Jamal Hadi Salim @ 2004-09-29  2:14 UTC (permalink / raw)
  To: Henrik Nordstrom
  Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
	KOVACS Krisztian

On Tue, 2004-09-28 at 11:07, Henrik Nordstrom wrote:
> On Tue, 28 Sep 2004, KOVACS Krisztian wrote:

> 
> If you only have a single IP using division by IP addresses is clearly not 
> an option.

Like you said earlier, the allocation is IPaddress:portrange. I do think
this will require some extra hack for the mapping. Am i mistaken? 
So instead if you just zeroed out the IPaddres piece, then the only
change left is por range.
I believe this is already supported in the form of
/proc/sys/net/ipv4/ip_local_port_range
So even if the protocol is not ready, make this a static config
in each node. Ensure unique ranges on each node whose alloaction is 
synchronized by a human to start with. 
Unfortunately i am not sure if you can force current to always get
its port allocation from the allocated range _only_. Is this doable?
Caveat: You just limited your active connections for your cluster to 64K
flows. Clearly ability to get IPaddress:port range mapping will fix
that; I ma just thinking first steps.

> The full re-sync is just one more facet of the same problem. If one is 
> solved you can solve both. It needs to be known per connection which node 
> is currently the master, and the addressing/forwarding scheme needs to 
> make sure the node who is master for a connection sees the traffic it 
> needs to se.
> 
> > But if we maintain some per-conntrack mark indicating
> > which node "owns" that entry, then even full re-sync could be
> > implemented quite easily: each node dumps all entries it is responsible
> > for. The protocol itself should be extended as well, we would need
> > per-node sequence numbers and per-node recovery requests.
> 
> Indeed.
> 
> And the flow question of existing connections is also not such big problem 
> as all you need is to look up the connection in conntrack and if not 
> master you either drop the packet (multicast forwarding model) or 
> request to become master of the connection (unicast forwarding model). 
> The multicast forwarding model is a lot easier to implement but does not 
> scale as well as all nodes need to see all traffic and drop the traffic 
> for which they are not master.
> 
> This moves the "clusterIP" decision to after the conntrack lookup but 
> before creating new conntrack sessions.

I think we should make the issue of balancing a separate item.
It shouldnt matter how the packet gets delivered to an active node
(pigeons, expensive loadbalancers, LVS, some routing tricks,
 etc all is fine by me). 
Assuming:
- The state is already synced across all the nodes;
- assuming theres no conflict and 
- assuming asymetry (ok, maybe thats too restrictive) but a good start. 

Then:
any node should be able to masquarade any packet. And any packet coming
from the external realm should get to its origin just fine.

cheers,
jamal

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-28 23:36                                       ` Henrik Nordstrom
@ 2004-09-29  3:00                                         ` Tobias DiPasquale
  2004-09-29  8:34                                           ` Henrik Nordstrom
  0 siblings, 1 reply; 55+ messages in thread
From: Tobias DiPasquale @ 2004-09-29  3:00 UTC (permalink / raw)
  To: Henrik Nordstrom
  Cc: KOVACS Krisztian, Harald Welte, netfilter-devel, Sven Schuster,
	Netfilter-failover list, Jamal Hadi Salim

On Wed, 29 Sep 2004 01:36:19 +0200 (CEST), Henrik Nordstrom
<hno@marasystems.com> wrote:
> There is many ways to get traffic on an node, each having their
> complications and limitations.
> 
> For unicast you need to find some method to divide the address space using
> existing metrics supported by your network. If we assume the trivial setup
> of an HA active-active server then this can for example be done by routing
> the traffic based on the source IP or by client based balancing by
> publishing the service using multiple IPs. In a forwarding NAT firewall it
> gets a little more complex as you have two sides with slightly different
> metrics to care about.  And N node setups gets even more complex as you
> then need to be able to divide and merge contexts across several nodes
> when adding/removing nodes unless you accept a imbalance in the load if a
> node member has failed (fixed contexts, where one node at a time is master
> per context).

The idea of what I was saying is still the same: active-active implies
making multiple machines answer for a single address, whether its a
firewall, load-balancer, DNS server, HTTP server or anything else.
Advertising the service on multiple IPs just means you have x * y
machines, where x == number of advertised addresses and y == number of
machines that actively service each individual advertised address.

> If you do not have any reasonable support for balancing in your network
> layer then you can either introduce a load balancer layer or use multicast
> (broadcast MAC, or by MAC cloning if supported by the network equipment..
> not all switches like MAC cloning)

I thought that having all of the active nodes respond to ARP requests
for the virtual IP would handle that, as the switch could no longer
bind an IP address to a particular switch port? Is that not the case
with some hardware?

> The multicast approach provides full flexibility in how connections are
> migrated among the nodes, but does not scale very well as all nodes sees
> all traffic.

Yeah, that's for sure.

> A simple unicast load balancing which does work with most equipment is
> policy routing based on IP's. Ports is also doable in most equipment but
> limits the protocols which can be supported.
> 
> Statically divide the "clients" in N groups, each group assigned to one
> virtual firewall with a virtual MAC address.
> 
> failover this virtual firewall among the nodes in the cluster as needed.
> there may only be one master at a time per virtual firewall but each node
> can be master for several virtual firewalls.

Right, but same situation as above. There are still multiple machines
acting as one by way of a virtual address.

> connections need to be ct_sync:ed from the current master to all potential
> backups, multiplied by each virtual firewall.

Is that desired? I'm now thinking that perhaps the goals of ct_sync
are not in line with in-cluster active-active load-balancing.

Here's what I mean: if you have a 4-node cluster of firewalls in
active-active configuration, then the states of all connections
flowing through ALL the boxes are replicated on all of the boxes. This
defeats the purpose of active-active load-balancing as each of the
boxes would then handle almost all of the load of the whole cluster. I
don't see why I'd want to synchronize the connection states between
_all_ of the machines in an active-active cluster.

> For virtual MAC address support to Linux see the mac_vlan patch. Allows
> you to create any number of virtual ethernet interfaces per real
> interface, each with their own MAC.

Nice, that sounds better than the hidden patch I was looking at. Thanks :)

-- 
[ Tobias DiPasquale ]
0x636f6465736c696e67657240676d61696c2e636f6d

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-29  2:14                               ` Jamal Hadi Salim
@ 2004-09-29  8:12                                 ` Henrik Nordstrom
  2004-09-29 11:13                                   ` Jamal Hadi Salim
  0 siblings, 1 reply; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-29  8:12 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
	KOVACS Krisztian

On Wed, 28 Sep 2004, Jamal Hadi Salim wrote:

> Like you said earlier, the allocation is IPaddress:portrange. I do think
> this will require some extra hack for the mapping. Am i mistaken?
> So instead if you just zeroed out the IPaddres piece, then the only
> change left is por range.
> I believe this is already supported in the form of
> /proc/sys/net/ipv4/ip_local_port_range

This is for local connections, not related to NAT.

> Unfortunately i am not sure if you can force current to always get
> its port allocation from the allocated range _only_. Is this doable?

iptables NAT allows you to specify the IP and port range acceptable.

> Caveat: You just limited your active connections for your cluster to 64K
> flows.

Not really. But if you only have a single IP address then the port 
allocation limits you to 64K flows per "other" IP. As already discussed 
the tuple needs to be unique between different flows, not neccesarily the 
port.

> I think we should make the issue of balancing a separate item.

It is separate from the matter of syncronization. But the reason why the 
syncronization protocol has not yet been designed for active-active is 
because no load balancing scheme has been designed which would work. The 
two goes hand in hand and both needs to be solved.

Most likely the first load balancing method which will get implemented 
(and forcing ct_sync to add the two minor pieces missing for active-active 
syncronization) is the multicast balancing method in a no-NAT clused where 
each firewall sees all traffic and selects what it looks closer at. This 
is by far the easiest to implement. But even this is not entirely trivial 
as there may be conflicts in flow key balance IDs depending on the 
direction of the flow, but most likely this problem is more theoretical 
than practical.

> It shouldnt matter how the packet gets delivered to an active node
> (pigeons, expensive loadbalancers, LVS, some routing tricks,
> etc all is fine by me).
> Assuming:
>
> - The state is already synced across all the nodes;
>
> - assuming theres no conflict and
>
> - assuming asymetry (ok, maybe thats too restrictive) but a good start.

Allowing for assymetric flows is not something which is realistic to aim 
for unless you also aim for absolute syncronization by delaying packets 
until the firewalls have been syncronized. Such design will scale very 
badly.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-29  3:00                                         ` Tobias DiPasquale
@ 2004-09-29  8:34                                           ` Henrik Nordstrom
  0 siblings, 0 replies; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-29  8:34 UTC (permalink / raw)
  To: Tobias DiPasquale
  Cc: KOVACS Krisztian, Harald Welte, netfilter-devel, Sven Schuster,
	Netfilter-failover list, Jamal Hadi Salim

On Tue, 28 Sep 2004, Tobias DiPasquale wrote:

> I thought that having all of the active nodes respond to ARP requests
> for the virtual IP would handle that, as the switch could no longer
> bind an IP address to a particular switch port? Is that not the case
> with some hardware?

Switches dont work with IP addreses, they work with MAC addresses.

There is some odd switches which only allows one port per MAC, cutting the 
traffic on the old port when seeing the MAC on a new port. But we are not 
likely to see these switches in environments where this type of firewalls 
are intended to be deployed.

>> failover this virtual firewall among the nodes in the cluster as needed.
>> there may only be one master at a time per virtual firewall but each node
>> can be master for several virtual firewalls.
>
> Right, but same situation as above. There are still multiple machines
> acting as one by way of a virtual address.

Well, to be honest I am not at all considering load balancing of a host, 
only load balancing of firewalls. A firewall does not really have a 
address in such sense, it has a ruleset on what is allowed to be forwarded 
and how the traffic should be mangled (NAT:ed) while it is forwarded. 
Traffic is not addressed TO the firewall, it is addressed to something on 
the other side of the firewall.  The load balancing of hosts is already 
done very well by LVS and is not really a problem which iptables/netfilter 
ct_sync needs to address. Because of this I do not restrict my thinking to 
load balancing methods which would work for a single host/service.

The network simply needs to know which firewall to send the traffic to 
depending on the type of flow. Depending on how smart your network is this 
places certain limitaitons on the type of load balancing methods you can 
select and what restrictions there is on the firewall/NAT rules.

>> connections need to be ct_sync:ed from the current master to all potential
>> backups, multiplied by each virtual firewall.
>
> Is that desired? I'm now thinking that perhaps the goals of ct_sync
> are not in line with in-cluster active-active load-balancing.

Let me reprhase the above

connections need to be ct_sync:ed from the current master to all potential 
backups of this virtual firewall. The potential backup nodes can be a 
subset of the total nodes of the clusted, at least one. This is multiplied 
by each virtual firewall as each virtual firewall needs to have it's 
connections syncronized from it's current master to it's potential 
backups.

The syncronization needs to be online, allowing a backup to recover the 
traffic in case the current master crashes and it's connection table is 
lost.

> Here's what I mean: if you have a 4-node cluster of firewalls in
> active-active configuration, then the states of all connections
> flowing through ALL the boxes are replicated on all of the boxes.

Not neccesarily, this cluster can be divided into 4 virtual firewalls each 
handling a subset of the traffic and each having with at least one 
potential backup node assigned. Each virtual firewall has a unique MAC 
shared among the potential nodes of this virtual firewall.

The drawback is that if one node fails then the load will be doubled on 
it's backup node as that node then gets two virtual firewalls. But the 
benefit is that this design allows for relatively easy unicast 
partitioning of the traffic flows using standard equipment.

If you want more granular load balancing then "simply" divide the setup in 
more virtual firewalls just as you would do if there was more nodes in the 
cluster. With 4 * 3 virtual firewalls you can get very good load 
distribution even in case of one, two or three node failures. The drawback 
is that the network setup gets more complex.

> This defeats the purpose of active-active load-balancing as each of the 
> boxes would then handle almost all of the load of the whole cluster. I 
> don't see why I'd want to synchronize the connection states between 
> _all_ of the machines in an active-active cluster.

Which is exacly what I am saying. Just different words.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-29  8:12                                 ` Henrik Nordstrom
@ 2004-09-29 11:13                                   ` Jamal Hadi Salim
  2004-09-29 11:29                                     ` KOVACS Krisztian
  2004-09-29 11:44                                     ` Henrik Nordstrom
  0 siblings, 2 replies; 55+ messages in thread
From: Jamal Hadi Salim @ 2004-09-29 11:13 UTC (permalink / raw)
  To: Henrik Nordstrom
  Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
	KOVACS Krisztian

On Wed, 2004-09-29 at 04:12, Henrik Nordstrom wrote:
> On Wed, 28 Sep 2004, Jamal Hadi Salim wrote:

> > Unfortunately i am not sure if you can force current to always get
> > its port allocation from the allocated range _only_. Is this doable?
> 
> iptables NAT allows you to specify the IP and port range acceptable.
> 

My brain access is getting slow - I knew this. But does the machine then
only restrict itself to those ports for that IP defined?

> > Caveat: You just limited your active connections for your cluster to 64K
> > flows.
> 
> Not really. But if you only have a single IP address then the port 
> allocation limits you to 64K flows per "other" IP. As already discussed 
> the tuple needs to be unique between different flows, not neccesarily the 
> port.

To clarify what you are saying: you will have (in the case of
masquareding single external realm IP) 64K possible flows per internal
realm _src_ IP?

> > I think we should make the issue of balancing a separate item.
> 
> It is separate from the matter of syncronization. But the reason why the 
> syncronization protocol has not yet been designed for active-active is 
> because no load balancing scheme has been designed which would work. The 
> two goes hand in hand and both needs to be solved.

The cluster shouldnt care how the packet got there. Put LVS infront of
cluster or play ARP tricks - It doesnt matter.
Maybe iam missing something.

> Most likely the first load balancing method which will get implemented 
> (and forcing ct_sync to add the two minor pieces missing for active-active 
> syncronization) is the multicast balancing method in a no-NAT clused where 
> each firewall sees all traffic and selects what it looks closer at.

I am not saying this shouldnt be supported, but why the restiction
to _only_ this? 

>  This 
> is by far the easiest to implement. But even this is not entirely trivial 
> as there may be conflicts in flow key balance IDs depending on the 
> direction of the flow, but most likely this problem is more theoretical 
> than practical.
> 

Thats what it seems to me. 
Maybe the piece i am missing is this "flow key balance IDs"...
If i have all the state of peer firewall, why should i not be able
to process packets you throw at me?

> > It shouldnt matter how the packet gets delivered to an active node
> > (pigeons, expensive loadbalancers, LVS, some routing tricks,
> > etc all is fine by me).
> > Assuming:
> >
> > - The state is already synced across all the nodes;
> >
> > - assuming theres no conflict and
> >
> > - assuming asymetry (ok, maybe thats too restrictive) but a good start.
> 
> Allowing for assymetric flows is not something which is realistic to aim 
> for unless you also aim for absolute syncronization by delaying packets 
> until the firewalls have been syncronized. Such design will scale very 
> badly.

Sorry, I meant symetric setup .. i.e equal looking machines with same or
similar set of addresses. Two masquareding examples:

I:
Machine A: internal IP: 10.0.0.1, external 1.1.1.1
Machine B: internal IP: 10.0.0.2, external 1.1.1.2

On failover, all addresses taken over by peer; state synced, things 
run as before.

II:
Machine A: internal IP: 10.0.0.1, external 1.1.1.1
Machine B: internal IP: 10.0.0.2, external 1.1.1.1

cheers,
jamal

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-29 11:13                                   ` Jamal Hadi Salim
@ 2004-09-29 11:29                                     ` KOVACS Krisztian
  2004-09-29 11:44                                     ` Henrik Nordstrom
  1 sibling, 0 replies; 55+ messages in thread
From: KOVACS Krisztian @ 2004-09-29 11:29 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
	Henrik Nordstrom


  Hi,

2004-09-29, sze keltezéssel 13:13-kor Jamal Hadi Salim ezt írta:
> Sorry, I meant symetric setup .. i.e equal looking machines with same or
> similar set of addresses. Two masquareding examples:
> 
> I:
> Machine A: internal IP: 10.0.0.1, external 1.1.1.1
> Machine B: internal IP: 10.0.0.2, external 1.1.1.2
> 
> On failover, all addresses taken over by peer; state synced, things 
> run as before.

  This is (almost) possible with the current code: all we should do is
to extend cts_proto to handle multiple redundancy groups at the same
time. Such a solution plus some simple VRRP-based redundancy would be
able to handle such a setup.

-- 
 Regards,
   Krisztian KOVACS

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-29 11:13                                   ` Jamal Hadi Salim
  2004-09-29 11:29                                     ` KOVACS Krisztian
@ 2004-09-29 11:44                                     ` Henrik Nordstrom
  2004-09-29 13:03                                       ` Jamal Hadi Salim
  1 sibling, 1 reply; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-29 11:44 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
	KOVACS Krisztian

On Wed, 29 Sep 2004, Jamal Hadi Salim wrote:

> To clarify what you are saying: you will have (in the case of
> masquareding single external realm IP) 64K possible flows per internal
> realm _src_ IP?

No, per external IP,port contacted by your internal addresses.

Original tuples

internal_station_ip,port external_server_ip,port

Masqueraded tuples

masquerade_ip,port(*) external_server_ip,port

> The cluster shouldnt care how the packet got there. Put LVS infront of
> cluster or play ARP tricks - It doesnt matter.
> Maybe iam missing something.

For a firewall you have to care about traffic on both sides, unless you 
are doing multicast balancing.  Your distribution must be 100% equal on 
both sides.

>> Most likely the first load balancing method which will get implemented
>> (and forcing ct_sync to add the two minor pieces missing for active-active
>> syncronization) is the multicast balancing method in a no-NAT clused where
>> each firewall sees all traffic and selects what it looks closer at.
>
> I am not saying this shouldnt be supported, but why the restiction
> to _only_ this?

Who have said this? It is certainly not my intentions to imply multicast 
load balancing is the only way to go.

>>  This is by far the easiest to implement. But even this is not entirely 
>> trivial as there may be conflicts in flow key balance IDs depending on 
>> the direction of the flow, but most likely this problem is more 
>> theoretical than practical.
>>
>
> Thats what it seems to me.
> Maybe the piece i am missing is this "flow key balance IDs"...
> If i have all the state of peer firewall, why should i not be able
> to process packets you throw at me?

Beause you don't have the all the state. There is significant delays in 
the state distribution unless you accept to make the state syncronization 
syncronous with the packet flow (no packet forwarded before the state 
change this packet implies have been verifiable syncronized to all 
firewalls) which would in effect make the total cluster performance some 
orders of magnitude worse than having a single firewall with no or very 
little options for load balancing.

> Sorry, I meant symetric setup .. i.e equal looking machines with same or
> similar set of addresses. Two masquareding examples:
>
> I:
> Machine A: internal IP: 10.0.0.1, external 1.1.1.1
> Machine B: internal IP: 10.0.0.2, external 1.1.1.2
>
> On failover, all addresses taken over by peer; state synced, things
> run as before.
>
> II:
> Machine A: internal IP: 10.0.0.1, external 1.1.1.1
> Machine B: internal IP: 10.0.0.2, external 1.1.1.1

I don't quite follow what you aim for here. Please outline this in terms 
of connections (full tuples), masquerading/NAT (modified tuples) and 
routing/forwarding of traffic in both directions.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-29 11:44                                     ` Henrik Nordstrom
@ 2004-09-29 13:03                                       ` Jamal Hadi Salim
  2004-09-29 13:41                                         ` Henrik Nordstrom
  0 siblings, 1 reply; 55+ messages in thread
From: Jamal Hadi Salim @ 2004-09-29 13:03 UTC (permalink / raw)
  To: Henrik Nordstrom
  Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
	KOVACS Krisztian

On Wed, 2004-09-29 at 07:44, Henrik Nordstrom wrote:
> On Wed, 29 Sep 2004, Jamal Hadi Salim wrote:
> 
> > To clarify what you are saying: you will have (in the case of
> > masquareding single external realm IP) 64K possible flows per internal
> > realm _src_ IP?
> 
> No, per external IP,port contacted by your internal addresses.
> 
> Original tuples
> 
> internal_station_ip,port external_server_ip,port
> 
> Masqueraded tuples
> 
> masquerade_ip,port(*) external_server_ip,port

Ok, so my math was off. A relief. But if all were going to the same
external IP then limit is 64K.

> > The cluster shouldnt care how the packet got there. Put LVS infront of
> > cluster or play ARP tricks - It doesnt matter.
> > Maybe iam missing something.
> 
> For a firewall you have to care about traffic on both sides, unless you 
> are doing multicast balancing.  

I am assuming ct_sync is creating these states for me. 

> Your distribution must be 100% equal on 
> both sides.
> 

Given above assumption, why do i need this?


> Who have said this? It is certainly not my intentions to imply multicast 
> load balancing is the only way to go.

Thats how i read your message - but if its but one option, fine.

> >>  This is by far the easiest to implement. But even this is not entirely 
> >> trivial as there may be conflicts in flow key balance IDs depending on 
> >> the direction of the flow, but most likely this problem is more 
> >> theoretical than practical.
> >>
> >
> > Thats what it seems to me.
> > Maybe the piece i am missing is this "flow key balance IDs"...
> > If i have all the state of peer firewall, why should i not be able
> > to process packets you throw at me?
> 
> Beause you don't have the all the state. There is significant delays in 
> the state distribution unless you accept to make the state syncronization 
> syncronous with the packet flow (no packet forwarded before the state 
> change this packet implies have been verifiable syncronized to all 
> firewalls) which would in effect make the total cluster performance some 
> orders of magnitude worse than having a single firewall with no or very 
> little options for load balancing.

I was assuming all along we have agreed that all necessary state would
be synced. Yes, this may generate a lot of traffic, but it does seem to
be neccessary evil.


> > Sorry, I meant symetric setup .. i.e equal looking machines with same or
> > similar set of addresses. Two masquareding examples:
> >
> > I:
> > Machine A: internal IP: 10.0.0.1, external 1.1.1.1
> > Machine B: internal IP: 10.0.0.2, external 1.1.1.2
> >
> > On failover, all addresses taken over by peer; state synced, things
> > run as before.
> >
> > II:
> > Machine A: internal IP: 10.0.0.1, external 1.1.1.1
> > Machine B: internal IP: 10.0.0.2, external 1.1.1.1
> 
> I don't quite follow what you aim for here. Please outline this in terms 
> of connections (full tuples), masquerading/NAT (modified tuples) and 
> routing/forwarding of traffic in both directions.

10.0.0.0/24 is internal address. There are ~250 machines behind cluster.
Cluster contain those two machines only.
provider gives us one or two IPs depending on setup.

cheers,
jamal

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-29 13:03                                       ` Jamal Hadi Salim
@ 2004-09-29 13:41                                         ` Henrik Nordstrom
  2004-09-29 14:23                                           ` jamal
  0 siblings, 1 reply; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-29 13:41 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
	KOVACS Krisztian

On Wed, 29 Sep 2004, Jamal Hadi Salim wrote:

> Ok, so my math was off. A relief. But if all were going to the same
> external IP then limit is 64K.

Yes, unless you have more than one masquerade IP or they go to different 
ports.

Basic limitation of TCP/IP. Not much netfilter can do about. Between two 
IP addresses where one is server (single port) the other client (dynamic 
port) there can be no more than 64K connections as the only piece of the 
tuple identifying such connections is the source port of the client, 
limited to 2^16.

>> For a firewall you have to care about traffic on both sides, unless you
>> are doing multicast balancing.
>
> I am assuming ct_sync is creating these states for me.

ct_sync or not does not really matter in this part discussion. ct_sync 
only allows for graceful failover of the session from one firewall to 
another.

if you want to have an active-active firewall using unicast distribution 
then your network (not the firewall) must somehow ensure that traffic in 
both directions goes via the correct firewall with no asymmetric data 
flows. If not the state syncronization becomes impractical.

Internal Network -> Firewall

Internet -> Firewall

>> Your distribution must be 100% equal on
>> both sides.
>
> Given above assumption, why do i need this?

because syncronosation of the firewalls is not instantaneous or atomic. 
There is a delay before a state change on one firewall is replicated to 
all other firewalls.

>> Beause you don't have the all the state. There is significant delays in
>> the state distribution unless you accept to make the state syncronization
>> syncronous with the packet flow (no packet forwarded before the state
>> change this packet implies have been verifiable syncronized to all
>> firewalls) which would in effect make the total cluster performance some
>> orders of magnitude worse than having a single firewall with no or very
>> little options for load balancing.
>
> I was assuming all along we have agreed that all necessary state would
> be synced. Yes, this may generate a lot of traffic, but it does seem to
> be neccessary evil.

Syncing state in absolute realtime is in my opinion not an option. This 
would require delaying forwarding of every packet until the state change 
inflicted by this packet has been fully replicated to all nodes in the 
cluster, which in terms of performance is some orders of magnitude worse 
than using the multicast approach to have the traffic distributed among 
the nodes.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-29 13:41                                         ` Henrik Nordstrom
@ 2004-09-29 14:23                                           ` jamal
  2004-09-29 15:02                                             ` Henrik Nordstrom
  0 siblings, 1 reply; 55+ messages in thread
From: jamal @ 2004-09-29 14:23 UTC (permalink / raw)
  To: Henrik Nordstrom
  Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
	KOVACS Krisztian

On Wed, 2004-09-29 at 09:41, Henrik Nordstrom wrote:
> On Wed, 29 Sep 2004, Jamal Hadi Salim wrote:

[..]

> if you want to have an active-active firewall using unicast distribution 
> then your network (not the firewall) must somehow ensure that traffic in 
> both directions goes via the correct firewall with no asymmetric data 
> flows.

Lets assume symetry.
My goal is to make _all_ firewalls in the cluster the "correct" firewall

> >> Your distribution must be 100% equal on
> >> both sides.
> >
> > Given above assumption, why do i need this?
> 
> because syncronosation of the firewalls is not instantaneous or atomic. 
> There is a delay before a state change on one firewall is replicated to 
> all other firewalls.

Where this becomes an issue is if your responses come back faster than
you can synchronize state. If you are going across big bad internet
where latencies start in the ms range, then the issue becomes less
challenging. 

> > I was assuming all along we have agreed that all necessary state would
> > be synced. Yes, this may generate a lot of traffic, but it does seem to
> > be neccessary evil.
> 
> Syncing state in absolute realtime is in my opinion not an option. This 
> would require delaying forwarding of every packet until the state change 
> inflicted by this packet has been fully replicated to all nodes in the 
> cluster, which in terms of performance is some orders of magnitude worse 
> than using the multicast approach to have the traffic distributed among 
> the nodes.

I think what you describe above needs to be done in the case of response
latency being lower than update latency. i.e its not a bad option. It
will slow down the setup time but thats only for the firts new packet.
The impact on susbequent packets in the flow should be low.

cheers,
jamal

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-29 14:23                                           ` jamal
@ 2004-09-29 15:02                                             ` Henrik Nordstrom
  2004-09-30 12:24                                               ` jamal
  0 siblings, 1 reply; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-29 15:02 UTC (permalink / raw)
  To: jamal
  Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
	KOVACS Krisztian

On Wed, 29 Sep 2004, jamal wrote:

> Where this becomes an issue is if your responses come back faster than
> you can synchronize state. If you are going across big bad internet
> where latencies start in the ms range, then the issue becomes less
> challenging.

Problem is you don't really know.

If it was the case that when you received the reply packet you could know 
the state of this has not yet been syncronized then no problem, but before 
the state has been syncronized you don't know it is reply traffic.

> I think what you describe above needs to be done in the case of response
> latency being lower than update latency. i.e its not a bad option. It
> will slow down the setup time but thats only for the firts new packet.

Only if you accept sloppy connection tracking without TCP windows etc. 
With netfilter conntrack moving to full tracking it is no longer the case 
and you will need relatively frequent syncronizations during the session, 
not only the first packet.

> The impact on susbequent packets in the flow should be low.

Hopefully.

Regards
Henrik

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
  2004-09-29 15:02                                             ` Henrik Nordstrom
@ 2004-09-30 12:24                                               ` jamal
  0 siblings, 0 replies; 55+ messages in thread
From: jamal @ 2004-09-30 12:24 UTC (permalink / raw)
  To: Henrik Nordstrom
  Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
	KOVACS Krisztian

On Wed, 2004-09-29 at 11:02, Henrik Nordstrom wrote:
> On Wed, 29 Sep 2004, jamal wrote:

[..]
> If it was the case that when you received the reply packet you could know 
> the state of this has not yet been syncronized then no problem, but before 
> the state has been syncronized you don't know it is reply traffic.

Yes.

> > I think what you describe above needs to be done in the case of response
> > latency being lower than update latency. i.e its not a bad option. It
> > will slow down the setup time but thats only for the firts new packet.
> 
> Only if you accept sloppy connection tracking without TCP windows etc. 
> With netfilter conntrack moving to full tracking it is no longer the case 
> and you will need relatively frequent syncronizations during the session, 
> not only the first packet.

Hehehe. What is this? a conspiracy to make it harder to sync? ;->
Need some more thinking

cheers,
jamal

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2004-09-30 12:24 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-13 14:26 [RFC] ct_sync 0.15 (corrected) KOVACS Krisztian
2004-08-19 11:06 ` Harald Welte
2004-08-19 12:13   ` KOVACS Krisztian
2004-08-26 10:00     ` Jozsef Kadlecsik
2004-08-26 11:12       ` KOVACS Krisztian
2004-08-26 11:39         ` Jozsef Kadlecsik
2004-08-26 16:14           ` [nf-failover] " KOVACS Krisztian
2004-08-19 12:13   ` KOVACS Krisztian
2004-08-19 16:13   ` Henrik Nordstrom
2004-08-22 20:43   ` KOVACS Krisztian
2004-08-24 18:37     ` Harald Welte
2004-08-25 11:41       ` jamal
2004-08-22  0:40 ` Patrick McHardy
2004-08-22  7:49   ` [nf-failover] " KOVACS Krisztian
2004-08-22 20:42     ` Sven Schuster
2004-08-23  9:51       ` Patrick McHardy
2004-09-02  5:10 ` Willy Tarreau
2004-09-02 12:39   ` KOVACS Krisztian
2004-09-24  2:42 ` jamal
2004-09-25  7:52   ` [nf-failover] " Harald Welte
2004-09-27 13:07     ` jamal
2004-09-27 13:30       ` KOVACS Krisztian
2004-09-27 13:39       ` Harald Welte
2004-09-28  2:41         ` jamal
2004-09-28  6:46           ` Henrik Nordstrom
2004-09-28 10:56             ` jamal
2004-09-28 12:24               ` KOVACS Krisztian
2004-09-28 12:35                 ` Henrik Nordstrom
2004-09-28 12:57                   ` KOVACS Krisztian
2004-09-28 13:14                     ` jamal
     [not found]                       ` <1096379957.1026.5.camel@jzny.localdomain>
2004-09-28 14:46                         ` Henrik Nordstrom
2004-09-28 14:56                           ` KOVACS Krisztian
2004-09-28 15:07                             ` Henrik Nordstrom
2004-09-28 18:04                               ` Sven Schuster
2004-09-28 18:47                                 ` Henrik Nordstrom
2004-09-28 20:57                                   ` Sven Schuster
2004-09-28 22:30                                     ` Tobias DiPasquale
2004-09-28 23:36                                       ` Henrik Nordstrom
2004-09-29  3:00                                         ` Tobias DiPasquale
2004-09-29  8:34                                           ` Henrik Nordstrom
2004-09-29  2:14                               ` Jamal Hadi Salim
2004-09-29  8:12                                 ` Henrik Nordstrom
2004-09-29 11:13                                   ` Jamal Hadi Salim
2004-09-29 11:29                                     ` KOVACS Krisztian
2004-09-29 11:44                                     ` Henrik Nordstrom
2004-09-29 13:03                                       ` Jamal Hadi Salim
2004-09-29 13:41                                         ` Henrik Nordstrom
2004-09-29 14:23                                           ` jamal
2004-09-29 15:02                                             ` Henrik Nordstrom
2004-09-30 12:24                                               ` jamal
2004-09-28 13:58                     ` Henrik Nordstrom
2004-09-28 14:24                       ` Tobias DiPasquale
2004-09-28 11:58             ` Tobias DiPasquale
2004-09-28 12:11               ` KOVACS Krisztian
2004-09-28 12:31               ` Henrik Nordstrom

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.