* [RFC] ct_sync 0.15 (corrected)
@ 2004-08-13 14:26 KOVACS Krisztian
2004-08-19 11:06 ` Harald Welte
` (3 more replies)
0 siblings, 4 replies; 55+ messages in thread
From: KOVACS Krisztian @ 2004-08-13 14:26 UTC (permalink / raw)
To: Netfilter-failover list; +Cc: netfilter-devel, jamal
Hi,
[warning: this mail got quite long]
I think finally I managed to get ct_sync in a state where some more
public review/testing would certainly do good for the project. This week
I've been successfully running overnight stress tests on our two-node
cluster, and both of the nodes survived all of the tests without locking
up, panicking, etc. OK, I know this is not too much, but is certainly
better than anything we had up to now, and somehow I felt this would be
the bare minimum I should reach before asking anyone to try out ct_sync.
Of course, there is still a lot to do. Just to mention two things which
certainly need a lot of work: protocol implementation is really the bare
minimum which is capable of doing anything, and expectation support is
missing completely. The current implementation supports replicating
conntrack entries, and supports NAT. So basically all the simple things
(not using expectations) should work reasonably well. This is why I've
bumped the version number to 0.15, it is available in the Netfilter CVS.
The aim of this e-mail is twofold:
* First, I just thought that ct_sync is now ready for some
testing. Unfortunately our development and testing environment
is very limited, so anyone being able to help us out and do some
testing would be of a big help.
* After doing some work to stabilize the current code, there is
now need to discuss a few things before going on with
implementation. So, I'd be happy if there was some discussion on
these things before starting to implement anything.
I'll try to summarize the most important open problems I've come
through. This list does not contain things I consider trivial, for
example exporting important internal constants as tunable parameters
through sysctl.
1. There should some facility by which one can select which
connections have to be replicated. This way it would be possible
to limit replication traffic to the bare minimum. For example,
there is no point in replicating conntrack entries for
connections whose endpoint is one of the nodes (administrative
SSH traffic, for example). A per-conntrack flag would be needed,
just like CONNMARK, which could be set for conntracks needing
replication with a simple iptables rule. Actually, CONNMARK is
enough, if we choose a given bit of the mark as the SYNC bit.
Besides this, we should decide if we needed a SYNC or a NOSYNC
bit, that is, if the default mode of operation should be "sync
or not to sync".
2. The error recovery functions in the protocol layer should be
revamped. The protocol is a plain sequence numbered, NACK-based
one using multicast UDP. Lost packets are detected by the
receiver when receiving the next packet with the inappropriate
sequence number (not 'seq of the last packet + 1') is received.
When this occurs, the node sends a recovery request containing
the last successfully received sequence number. When the current
master receives such a request, it should re-send all matching
packets from the backlog of its send ring. However, to iterate
over the entries of the ring, it should hold the spinlock of the
ring, which is not possible, since the send() operation may
sleep... (This is done from the receiver thread, and the ring is
accessed from the sender thread and from softirq context as well.)
What would be the most elegant solution?
On the other hand, it may be possible that the master is not
able to re-send the packet, for example this may be the case if
it is "too old", and is not present in the backlog anymore. In
this case, the slave should be notified that recovery is not
possible this way, and it needs to do a full re-sync. This is
why I thought that we should include some extra information in
every packet: the minimal sequence number of the oldest packet
in the master's backlog. Using this approach upon receiving a
packet with a 'wrong' sequence number the slave can immediately
decide if there is still hope of recovering the missing packets
or not. If not, it requests full-resynchronization instead. I
think this feature could handle the problem of a broken link
between the nodes causing a lots of lost packets. Am I missing
something?
The protocol layer is really very dumb in other respects as well.
For example it simply drops all packets considered too fresh
instead of queuing them. (Although I would like to add that
typically there should be no errors on the replication network
at all, except because of administrative reasons.) However, I
really don't think we should develop a full-blown protocol,
there is simply no point in creating one more reliable protocol...
So, do anyone know of anything which could be used by ct_sync?
(It has to be a semi-reliable, connectionless multicast protocol
with a _very_ low overhead.)
3. There are a few things in the connection tracking code which
are incompatible with replication "by design". For example,
the expectfn() function in the expectation structure is such:
simply, there is no way to replicate a stand-alone function
pointer which could point to any arbitrary function. One more
example could be TCP window tracking, I don't think we have
the necessary bandwidth and CPU time to send an update message
after each and every received TCP packet... Any idea how we
could solve these problems?
4. The current version is 2.4-only, it is for the good old
ip_conntrack, and supports IPv4 only. I don't really think
this is the way to go, but there is commercial interest in
having this kind of failover functionality as fast as possible.
However, I think that after reaching some state which is
acceptable for the users needing the basic features fast, this
whole thing should be re-designed and ported to 2.6 and
nf_conntrack. This would depend on a few other things, such as
porting ctnetlink for nf_conntrack, but I thing those would
be important to have as well. Again, this would be quite a
lot work to do, thus deferring the 'stable' (production ready)
release of the code.
Wow, this got quite long, thanks for reading all this. :)
And happy testing and reporting back! :)
--
Regards,
Krisztian KOVACS
^ permalink raw reply [flat|nested] 55+ messages in thread* Re: [RFC] ct_sync 0.15 (corrected)
2004-08-13 14:26 [RFC] ct_sync 0.15 (corrected) KOVACS Krisztian
@ 2004-08-19 11:06 ` Harald Welte
2004-08-19 12:13 ` KOVACS Krisztian
` (3 more replies)
2004-08-22 0:40 ` Patrick McHardy
` (2 subsequent siblings)
3 siblings, 4 replies; 55+ messages in thread
From: Harald Welte @ 2004-08-19 11:06 UTC (permalink / raw)
To: KOVACS Krisztian; +Cc: Netfilter-failover list, netfilter-devel, jamal
[-- Attachment #1: Type: text/plain, Size: 5288 bytes --]
On Fri, Aug 13, 2004 at 04:26:30PM +0200, KOVACS Krisztian wrote:
> 1. There should some facility by which one can select which
> connections have to be replicated. This way it would be possible
> to limit replication traffic to the bare minimum. For example,
> there is no point in replicating conntrack entries for
> connections whose endpoint is one of the nodes (administrative
> SSH traffic, for example). A per-conntrack flag would be needed,
> just like CONNMARK, which could be set for conntracks needing
> replication with a simple iptables rule. Actually, CONNMARK is
> enough, if we choose a given bit of the mark as the SYNC bit.
> Besides this, we should decide if we needed a SYNC or a NOSYNC
> bit, that is, if the default mode of operation should be "sync
> or not to sync".
I would just use connmark for now. Let's make it a CONFIG option
though, so people can just use connmark without any interference and
replicate all connections.
> 2. The error recovery functions in the protocol layer should be
> revamped.
> However, to iterate over the entries of the ring, it should
> hold the spinlock of the ring, which is not possible, since
> the send() operation may sleep... (This is done from the
> receiver thread, and the ring is accessed from the sender
> thread and from softirq context as well.) What would be the
> most elegant solution?
given that this is a event expected to happen very rarely, I would
propose to just:
- grab the lock
- copy the whole ring (or the needed parts)
- release the lock
- send packets from the local copy (may sleep)
- free local copy
> On the other hand, it may be possible that the master is not
> able to re-send the packet, for example this may be the case if
> it is "too old", and is not present in the backlog anymore. In
> this case, the slave should be notified that recovery is not
> possible this way, and it needs to do a full re-sync.
Within the current protocol, the master can just make that decision and
do a full resync without telling the slave.
> This is why I thought that we should include some extra
> information in every packet: the minimal sequence number of
> the oldest packet in the master's backlog.
Agreed. We should also add a read-only sysctl that tells userspace
whether a slave is already fully-synced.
> So, do anyone know of anything which could be used by ct_sync?
> (It has to be a semi-reliable, connectionless multicast protocol
> with a _very_ low overhead.)
everything I've seen so far about reliable multicast is inherently
complex.
> 3. There are a few things in the connection tracking code which
> are incompatible with replication "by design". For example,
> the expectfn() function in the expectation structure is such:
> simply, there is no way to replicate a stand-alone function
> pointer which could point to any arbitrary function.
Yes, indeed. we could look up the symbol name in the symbol table and
replicate that ;) Crude hack, but it would work.
> One more example could be TCP window tracking, I don't think we
> have the necessary bandwidth and CPU time to send an update
> message after each and every received TCP packet... Any idea
> how we could solve these problems?
We already do this since the timeout is updated with every packet. So
at this point, I see not much difference. Jozsef and I agreed some time
in the past, that if we don't replicate all the window information, in
the event of a slave being propagated to master, the new master should
disable windowtracking or switch into a lazy mode.
> 4. The current version is 2.4-only, it is for the good old
> ip_conntrack, and supports IPv4 only. I don't really think
> this is the way to go, but there is commercial interest in
> having this kind of failover functionality as fast as possible.
Ack.
> However, I think that after reaching some state which is
> acceptable for the users needing the basic features fast, this
> whole thing should be re-designed and ported to 2.6 and
> nf_conntrack. This would depend on a few other things, such as
> porting ctnetlink for nf_conntrack, but I thing those would
> be important to have as well. Again, this would be quite a
> lot work to do, thus deferring the 'stable' (production ready)
> release of the code.
I would first make the 2.4.x version stable and almost feature-complete
(as far as possible). We have then learned our lessons and can clean it
up while porting on top of nf_conntrack.
> Regards,
> Krisztian KOVACS
--
- Harald Welte <laforge@netfilter.org> http://www.netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 55+ messages in thread* Re: [RFC] ct_sync 0.15 (corrected)
2004-08-19 11:06 ` Harald Welte
@ 2004-08-19 12:13 ` KOVACS Krisztian
2004-08-26 10:00 ` Jozsef Kadlecsik
2004-08-19 12:13 ` KOVACS Krisztian
` (2 subsequent siblings)
3 siblings, 1 reply; 55+ messages in thread
From: KOVACS Krisztian @ 2004-08-19 12:13 UTC (permalink / raw)
To: Harald Welte; +Cc: Netfilter-failover list, netfilter-devel
Hi,
2004-08-19, cs keltezéssel 13:06-kor Harald Welte ezt írta:
> > On the other hand, it may be possible that the master is not
> > able to re-send the packet, for example this may be the case if
> > it is "too old", and is not present in the backlog anymore. In
> > this case, the slave should be notified that recovery is not
> > possible this way, and it needs to do a full re-sync.
>
> Within the current protocol, the master can just make that decision and
> do a full resync without telling the slave.
Indeed. However, I'd like to implement a new state (SLAVE_BROKEN)
later, which could be used to avoid electing a slave as new master if
that node is known to be not in sync. A facility of this kind would help
in early detection of such cases.
> > 3. There are a few things in the connection tracking code which
> > are incompatible with replication "by design". For example,
> > the expectfn() function in the expectation structure is such:
> > simply, there is no way to replicate a stand-alone function
> > pointer which could point to any arbitrary function.
>
> Yes, indeed. we could look up the symbol name in the symbol table and
> replicate that ;) Crude hack, but it would work.
Unfortunately I don't think it would work... AFAIK there is no symbol
table information in 2.4 kernels and usually these functions are
declared as static anyway. Moreover, take a look at the H.323 helper, it
uses this expectfn function to set the helper of the conntrack to an
unregistered helper structure... I think there is no point in making
ct_sync overly complicated; these helpers should be fixed instead. (I
don't know why the H.323 does things this way, but it is completely
hopeless to replicate things like this.)
> > One more example could be TCP window tracking, I don't think we
> > have the necessary bandwidth and CPU time to send an update
> > message after each and every received TCP packet... Any idea
> > how we could solve these problems?
>
> We already do this since the timeout is updated with every packet. So
> at this point, I see not much difference. Jozsef and I agreed some time
> in the past, that if we don't replicate all the window information, in
> the event of a slave being propagated to master, the new master should
> disable windowtracking or switch into a lazy mode.
Indeed, the timeout is updated with every packet. However, we do not
generate an update message for IPCT_REFRESH events at the moment. On the
other hand, we replicate the timeout changes when a state change occurs
as well, so I don't worry about incorrect (relative) timeout values at
all.
Anyway, thanks for the comments.
--
Regards,
Krisztian KOVACS
^ permalink raw reply [flat|nested] 55+ messages in thread* Re: [RFC] ct_sync 0.15 (corrected)
2004-08-19 12:13 ` KOVACS Krisztian
@ 2004-08-26 10:00 ` Jozsef Kadlecsik
2004-08-26 11:12 ` KOVACS Krisztian
0 siblings, 1 reply; 55+ messages in thread
From: Jozsef Kadlecsik @ 2004-08-26 10:00 UTC (permalink / raw)
To: KOVACS Krisztian; +Cc: Harald Welte, netfilter-devel, Netfilter-failover list
On Thu, 19 Aug 2004, KOVACS Krisztian wrote:
> > > 3. There are a few things in the connection tracking code which
> > > are incompatible with replication "by design". For example,
> > > the expectfn() function in the expectation structure is such:
> > > simply, there is no way to replicate a stand-alone function
> > > pointer which could point to any arbitrary function.
> >
> > Yes, indeed. we could look up the symbol name in the symbol table and
> > replicate that ;) Crude hack, but it would work.
>
> Unfortunately I don't think it would work... AFAIK there is no symbol
> table information in 2.4 kernels and usually these functions are
> declared as static anyway. Moreover, take a look at the H.323 helper, it
> uses this expectfn function to set the helper of the conntrack to an
> unregistered helper structure... I think there is no point in making
> ct_sync overly complicated; these helpers should be fixed instead. (I
> don't know why the H.323 does things this way, but it is completely
> hopeless to replicate things like this.)
We could add the expect function to the ip_conntrack_helper structure and
identify it by the helper name in the update messages. The unregistered
helper in the H.323 conntrack/nat module could be registered with an
invalid, never matching port and let the expect function handle it as
before (because the real port is dynamic). I think that'd be sufficient in
solving the replication problem.
Unfortunately we cannot fix the H.323 protocol. :-)
> > > One more example could be TCP window tracking, I don't think we
> > > have the necessary bandwidth and CPU time to send an update
> > > message after each and every received TCP packet... Any idea
> > > how we could solve these problems?
> >
> > We already do this since the timeout is updated with every packet. So
> > at this point, I see not much difference. Jozsef and I agreed some time
> > in the past, that if we don't replicate all the window information, in
> > the event of a slave being propagated to master, the new master should
> > disable windowtracking or switch into a lazy mode.
>
> Indeed, the timeout is updated with every packet. However, we do not
> generate an update message for IPCT_REFRESH events at the moment. On the
> other hand, we replicate the timeout changes when a state change occurs
> as well, so I don't worry about incorrect (relative) timeout values at
> all.
That's still fine with TCP window tracking. Just as Harald wrote, we can
switch to lazy mode on the slaves. But to be honest, that is the least
verified part of the window tracking code.
Best regards,
Jozsef
-
E-mail : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
H-1525 Budapest 114, POB. 49, Hungary
^ permalink raw reply [flat|nested] 55+ messages in thread* Re: [RFC] ct_sync 0.15 (corrected)
2004-08-26 10:00 ` Jozsef Kadlecsik
@ 2004-08-26 11:12 ` KOVACS Krisztian
2004-08-26 11:39 ` Jozsef Kadlecsik
0 siblings, 1 reply; 55+ messages in thread
From: KOVACS Krisztian @ 2004-08-26 11:12 UTC (permalink / raw)
To: Jozsef Kadlecsik; +Cc: Harald Welte, netfilter-devel, Netfilter-failover list
Hi,
2004-08-26, cs keltezéssel 12:00-kor Jozsef Kadlecsik ezt írta:
> > Unfortunately I don't think it would work... AFAIK there is no symbol
> > table information in 2.4 kernels and usually these functions are
> > declared as static anyway. Moreover, take a look at the H.323 helper, it
> > uses this expectfn function to set the helper of the conntrack to an
> > unregistered helper structure... I think there is no point in making
> > ct_sync overly complicated; these helpers should be fixed instead. (I
> > don't know why the H.323 does things this way, but it is completely
> > hopeless to replicate things like this.)
>
> We could add the expect function to the ip_conntrack_helper structure and
> identify it by the helper name in the update messages. The unregistered
> helper in the H.323 conntrack/nat module could be registered with an
> invalid, never matching port and let the expect function handle it as
> before (because the real port is dynamic). I think that'd be sufficient in
> solving the replication problem.
Sounds good. This way the could replicate the expectfn function along
with the conntrack helper structure, and the unregistered helpers could
be handled as well. Although this might be a bit more complicated than
the current solution, but if we have to do some evil magic to handle
H.323, we should do that in a ct_sync compatible manner if possible...
> Unfortunately we cannot fix the H.323 protocol. :-)
Of course, I tried to write 'I don't know why the H.323 _helper_ does
things this way' but made some mistakes... :(
> > Indeed, the timeout is updated with every packet. However, we do not
> > generate an update message for IPCT_REFRESH events at the moment. On the
> > other hand, we replicate the timeout changes when a state change occurs
> > as well, so I don't worry about incorrect (relative) timeout values at
> > all.
>
> That's still fine with TCP window tracking. Just as Harald wrote, we can
> switch to lazy mode on the slaves. But to be honest, that is the least
> verified part of the window tracking code.
This would be an easy solution.
--
Regards,
Krisztian KOVACS
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] ct_sync 0.15 (corrected)
2004-08-26 11:12 ` KOVACS Krisztian
@ 2004-08-26 11:39 ` Jozsef Kadlecsik
2004-08-26 16:14 ` [nf-failover] " KOVACS Krisztian
0 siblings, 1 reply; 55+ messages in thread
From: Jozsef Kadlecsik @ 2004-08-26 11:39 UTC (permalink / raw)
To: KOVACS Krisztian; +Cc: Harald Welte, netfilter-devel, Netfilter-failover list
Hi,
On Thu, 26 Aug 2004, KOVACS Krisztian wrote:
> > We could add the expect function to the ip_conntrack_helper structure and
> > identify it by the helper name in the update messages. The unregistered
> > helper in the H.323 conntrack/nat module could be registered with an
> > invalid, never matching port and let the expect function handle it as
> > before (because the real port is dynamic). I think that'd be sufficient in
> > solving the replication problem.
>
> Sounds good. This way the could replicate the expectfn function along
> with the conntrack helper structure, and the unregistered helpers could
> be handled as well. Although this might be a bit more complicated than
> the current solution, but if we have to do some evil magic to handle
> H.323, we should do that in a ct_sync compatible manner if possible...
Because the so far unregistered H.323 helper were registered, that would
be fully ct_sync compatible, without the need to modify anyting in
ct_sync. The core/ct_sync should be modified for the expectn only and
that's a general requirement, independent of the H.323 helper.
Best regards,
Jozsef
-
E-mail : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
H-1525 Budapest 114, POB. 49, Hungary
^ permalink raw reply [flat|nested] 55+ messages in thread* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-08-26 11:39 ` Jozsef Kadlecsik
@ 2004-08-26 16:14 ` KOVACS Krisztian
0 siblings, 0 replies; 55+ messages in thread
From: KOVACS Krisztian @ 2004-08-26 16:14 UTC (permalink / raw)
To: Jozsef Kadlecsik; +Cc: Harald Welte, netfilter-devel, Netfilter-failover list
Hi,
On Thu, Aug 26, 2004 at 01:39:33PM +0200, Jozsef Kadlecsik wrote:
> Because the so far unregistered H.323 helper were registered, that would
> be fully ct_sync compatible, without the need to modify anyting in
> ct_sync. The core/ct_sync should be modified for the expectn only and
> that's a general requirement, independent of the H.323 helper.
In fact, ct_sync would not require any modifications, so I'd prefer this
solution.
--
KOVACS Krisztian
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] ct_sync 0.15 (corrected)
2004-08-19 11:06 ` Harald Welte
2004-08-19 12:13 ` KOVACS Krisztian
@ 2004-08-19 12:13 ` KOVACS Krisztian
2004-08-19 16:13 ` Henrik Nordstrom
2004-08-22 20:43 ` KOVACS Krisztian
3 siblings, 0 replies; 55+ messages in thread
From: KOVACS Krisztian @ 2004-08-19 12:13 UTC (permalink / raw)
To: Harald Welte; +Cc: Netfilter-failover list, netfilter-devel
Hi,
2004-08-19, cs keltezéssel 13:06-kor Harald Welte ezt írta:
> > On the other hand, it may be possible that the master is not
> > able to re-send the packet, for example this may be the case if
> > it is "too old", and is not present in the backlog anymore. In
> > this case, the slave should be notified that recovery is not
> > possible this way, and it needs to do a full re-sync.
>
> Within the current protocol, the master can just make that decision and
> do a full resync without telling the slave.
Indeed. However, I'd like to implement a new state (SLAVE_BROKEN)
later, which could be used to avoid electing a slave as new master if
that node is known to be not in sync. A facility of this kind would help
in early detection of such cases.
> > 3. There are a few things in the connection tracking code which
> > are incompatible with replication "by design". For example,
> > the expectfn() function in the expectation structure is such:
> > simply, there is no way to replicate a stand-alone function
> > pointer which could point to any arbitrary function.
>
> Yes, indeed. we could look up the symbol name in the symbol table and
> replicate that ;) Crude hack, but it would work.
Unfortunately I don't think it would work... AFAIK there is no symbol
table information in 2.4 kernels and usually these functions are
declared as static anyway. Moreover, take a look at the H.323 helper, it
uses this expectfn function to set the helper of the conntrack to an
unregistered helper structure... I think there is no point in making
ct_sync overly complicated; these helpers should be fixed instead. (I
don't know why the H.323 does things this way, but it is completely
hopeless to replicate things like this.)
> > One more example could be TCP window tracking, I don't think we
> > have the necessary bandwidth and CPU time to send an update
> > message after each and every received TCP packet... Any idea
> > how we could solve these problems?
>
> We already do this since the timeout is updated with every packet. So
> at this point, I see not much difference. Jozsef and I agreed some time
> in the past, that if we don't replicate all the window information, in
> the event of a slave being propagated to master, the new master should
> disable windowtracking or switch into a lazy mode.
Indeed, the timeout is updated with every packet. However, we do not
generate an update message for IPCT_REFRESH events at the moment. On the
other hand, we replicate the timeout changes when a state change occurs
as well, so I don't worry about incorrect (relative) timeout values at
all.
Anyway, thanks for the comments.
--
Regards,
Krisztian KOVACS
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] ct_sync 0.15 (corrected)
2004-08-19 11:06 ` Harald Welte
2004-08-19 12:13 ` KOVACS Krisztian
2004-08-19 12:13 ` KOVACS Krisztian
@ 2004-08-19 16:13 ` Henrik Nordstrom
2004-08-22 20:43 ` KOVACS Krisztian
3 siblings, 0 replies; 55+ messages in thread
From: Henrik Nordstrom @ 2004-08-19 16:13 UTC (permalink / raw)
To: Harald Welte
Cc: KOVACS Krisztian, Netfilter-failover list, netfilter-devel, jamal
On Thu, 19 Aug 2004, Harald Welte wrote:
> I would just use connmark for now. Let's make it a CONFIG option
> though, so people can just use connmark without any interference and
> replicate all connections.
With the (conn)mark operations patch discussed recently on netfilter-devel
it makes sense to use a bitmask for this, masking which bit(s) indicate a
session should be replicated.
Regards
Henrik
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] ct_sync 0.15 (corrected)
2004-08-19 11:06 ` Harald Welte
` (2 preceding siblings ...)
2004-08-19 16:13 ` Henrik Nordstrom
@ 2004-08-22 20:43 ` KOVACS Krisztian
2004-08-24 18:37 ` Harald Welte
3 siblings, 1 reply; 55+ messages in thread
From: KOVACS Krisztian @ 2004-08-22 20:43 UTC (permalink / raw)
To: Harald Welte, KOVACS Krisztian, Netfilter-failover list,
netfilter-devel, jamal
Hi,
On Thu, Aug 19, 2004 at 01:06:46PM +0200, Harald Welte wrote:
> > So, do anyone know of anything which could be used by ct_sync?
> > (It has to be a semi-reliable, connectionless multicast protocol
> > with a _very_ low overhead.)
>
> everything I've seen so far about reliable multicast is inherently
> complex.
Oops, I've just found TIPC. Does anyone know enough details of TIPC to
judge if its reliable multicast service would be useful for us? I've just
downloaded the IETF draft, and it seems to me that the reliable multicast
service provided by TIPC may be useful (section 2.9 of the draft). Any
ideas?
The SourceForge URL:
http://sourceforge.net/projects/tipc/
--
KOVACS Krisztian
^ permalink raw reply [flat|nested] 55+ messages in thread* Re: [RFC] ct_sync 0.15 (corrected)
2004-08-22 20:43 ` KOVACS Krisztian
@ 2004-08-24 18:37 ` Harald Welte
2004-08-25 11:41 ` jamal
0 siblings, 1 reply; 55+ messages in thread
From: Harald Welte @ 2004-08-24 18:37 UTC (permalink / raw)
To: KOVACS Krisztian, Netfilter-failover list, netfilter-devel, jamal
[-- Attachment #1: Type: text/plain, Size: 1419 bytes --]
On Sun, Aug 22, 2004 at 10:43:26PM +0200, KOVACS Krisztian wrote:
>
> Hi,
>
> On Thu, Aug 19, 2004 at 01:06:46PM +0200, Harald Welte wrote:
> > > So, do anyone know of anything which could be used by ct_sync?
> > > (It has to be a semi-reliable, connectionless multicast protocol
> > > with a _very_ low overhead.)
> >
> > everything I've seen so far about reliable multicast is inherently
> > complex.
>
> Oops, I've just found TIPC. Does anyone know enough details of TIPC to
> judge if its reliable multicast service would be useful for us? I've just
> downloaded the IETF draft, and it seems to me that the reliable multicast
> service provided by TIPC may be useful (section 2.9 of the draft). Any
> ideas?
Unfortunately I did only learn about TIPC recently. But looking at the
IETF draft and the current implementation, I think it is probably too
expensive. Another argument is to not base ct_sync on something big
outside of the official kernel tree that is not under our control..
--
- Harald Welte <laforge@netfilter.org> http://www.netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] ct_sync 0.15 (corrected)
2004-08-24 18:37 ` Harald Welte
@ 2004-08-25 11:41 ` jamal
0 siblings, 0 replies; 55+ messages in thread
From: jamal @ 2004-08-25 11:41 UTC (permalink / raw)
To: Harald Welte; +Cc: netfilter-devel, Netfilter-failover list, KOVACS Krisztian
As Harald says its a little heavyweight for what you guys need to do.
And it doesnt run on IP proper.
We are going to explore and probably use TIPC for ForCES as one of the
options for transport in the case of non-IP.
For IP we are thinking a dual transport approach. Two sockets, one
UDP/multicast while other is TCP|SCTP/unicast reliable.
Send always on UDP multicast; respond always on multicast. Reap the
benefits of multicast. Do NOT retransmit on multicast, retransmits of
either queries/updates etc are done over unicast.
This simple technique is borrowed from OSPF.
If you are interested we can discuss this more.
cheers,
jamal
On Tue, 2004-08-24 at 14:37, Harald Welte wrote:
> On Sun, Aug 22, 2004 at 10:43:26PM +0200, KOVACS Krisztian wrote:
> >
> > Hi,
> >
> > On Thu, Aug 19, 2004 at 01:06:46PM +0200, Harald Welte wrote:
> > > > So, do anyone know of anything which could be used by ct_sync?
> > > > (It has to be a semi-reliable, connectionless multicast protocol
> > > > with a _very_ low overhead.)
> > >
> > > everything I've seen so far about reliable multicast is inherently
> > > complex.
> >
> > Oops, I've just found TIPC. Does anyone know enough details of TIPC to
> > judge if its reliable multicast service would be useful for us? I've just
> > downloaded the IETF draft, and it seems to me that the reliable multicast
> > service provided by TIPC may be useful (section 2.9 of the draft). Any
> > ideas?
>
> Unfortunately I did only learn about TIPC recently. But looking at the
> IETF draft and the current implementation, I think it is probably too
> expensive. Another argument is to not base ct_sync on something big
> outside of the official kernel tree that is not under our control..
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] ct_sync 0.15 (corrected)
2004-08-13 14:26 [RFC] ct_sync 0.15 (corrected) KOVACS Krisztian
2004-08-19 11:06 ` Harald Welte
@ 2004-08-22 0:40 ` Patrick McHardy
2004-08-22 7:49 ` [nf-failover] " KOVACS Krisztian
2004-09-02 5:10 ` Willy Tarreau
2004-09-24 2:42 ` jamal
3 siblings, 1 reply; 55+ messages in thread
From: Patrick McHardy @ 2004-08-22 0:40 UTC (permalink / raw)
To: KOVACS Krisztian; +Cc: Netfilter-failover list, netfilter-devel, jamal
Hi Krisztian,
KOVACS Krisztian wrote:
> 4. The current version is 2.4-only, it is for the good old
> ip_conntrack, and supports IPv4 only. I don't really think
> this is the way to go, but there is commercial interest in
> having this kind of failover functionality as fast as possible.
> However, I think that after reaching some state which is
> acceptable for the users needing the basic features fast, this
> whole thing should be re-designed and ported to 2.6 and
> nf_conntrack. This would depend on a few other things, such as
> porting ctnetlink for nf_conntrack, but I thing those would
> be important to have as well. Again, this would be quite a
> lot work to do, thus deferring the 'stable' (production ready)
> release of the code.
>
Are there any differences between the nfnetlink-ctnetlink patch and the
ctnetlink patch in the netfilter-ha repository ? Porting ctnetlink to
2.6 would be a start. Maybe someone wants to do it, otherwise I'll do
it on a rainy day ..
Regards
Patrick
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-08-22 0:40 ` Patrick McHardy
@ 2004-08-22 7:49 ` KOVACS Krisztian
2004-08-22 20:42 ` Sven Schuster
0 siblings, 1 reply; 55+ messages in thread
From: KOVACS Krisztian @ 2004-08-22 7:49 UTC (permalink / raw)
To: Patrick McHardy
Cc: KOVACS Krisztian, Netfilter-failover list, netfilter-devel
Hi,
Patrick McHardy wrote:
> Are there any differences between the nfnetlink-ctnetlink patch and the
> ctnetlink patch in the netfilter-ha repository ? Porting ctnetlink to
> 2.6 would be a start. Maybe someone wants to do it, otherwise I'll do
> it on a rainy day ..
No, there is no difference apart from a small change I've already
mentioned on netfilter-devel wrt the NATINFO notification. For details, see
http://lists.netfilter.org/pipermail/netfilter-devel/2004-August/016225.html
AFAIK the ctnetlink patch has already been ported to 2.6, and a patch
was sent to the mailing list. Unfortunately I was unable to find that
mail in the mailing list archives...
--
KOVÁCS, Krisztián
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-08-22 7:49 ` [nf-failover] " KOVACS Krisztian
@ 2004-08-22 20:42 ` Sven Schuster
2004-08-23 9:51 ` Patrick McHardy
0 siblings, 1 reply; 55+ messages in thread
From: Sven Schuster @ 2004-08-22 20:42 UTC (permalink / raw)
To: KOVACS Krisztian
Cc: Patrick McHardy, Netfilter-failover list, netfilter-devel
Hi Krisztian, hi Patrick,
On Sun, Aug 22, 2004 at 09:49:26AM +0200, KOVACS Krisztian told us:
> AFAIK the ctnetlink patch has already been ported to 2.6, and a patch
> was sent to the mailing list. Unfortunately I was unable to find that
> mail in the mailing list archives...
look at this one :)
http://marc.theaimsgroup.com/?l=netfilter-devel&m=109154590603639&w=2
Still didn't find enough time to port it to a current 2.6 kernel. But
HTH!
Sven
--
Linux zion 2.6.8-rc2 #1 Sun Jul 18 15:00:48 CEST 2004 i686 athlon i386 GNU/Linux
22:38:58 up 35 days, 7 min, 1 user, load average: 0.00, 0.02, 0.00
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] ct_sync 0.15 (corrected)
2004-08-13 14:26 [RFC] ct_sync 0.15 (corrected) KOVACS Krisztian
2004-08-19 11:06 ` Harald Welte
2004-08-22 0:40 ` Patrick McHardy
@ 2004-09-02 5:10 ` Willy Tarreau
2004-09-02 12:39 ` KOVACS Krisztian
2004-09-24 2:42 ` jamal
3 siblings, 1 reply; 55+ messages in thread
From: Willy Tarreau @ 2004-09-02 5:10 UTC (permalink / raw)
To: KOVACS Krisztian; +Cc: netfilter-devel, jamal, Netfilter-failover list
Hi,
> One more
> example could be TCP window tracking, I don't think we have
> the necessary bandwidth and CPU time to send an update message
> after each and every received TCP packet... Any idea how we
> could solve these problems?
I don't really agree on the bandwidth argument : a small TCP packet with
only an ACK or a control flag is 40 bytes. This is 12 bytes more than the
smallest UDP packet, so if you can update a connection with at most 12 bytes,
you can use links of the same nature. BTW, I don't think you should send one
packet per connection. You should queue updates into a list, and build the
update packet from this list. This way, you eliminate the IP+UDP header (28
bytes) for all updates except the first one, which means you then have 40
bytes to update a connection without using more bandwidth. I'm not saying
this is much enough for every case, but I think that depending on the type
of message (creation, update, destruction), we might do things with this.
If you don't want to synchronize TCP windows, you might also turn the slave
in a "lazy mode" for existing connections when it becomes master. This is a
bit dirty, but might be an acceptable trade-off.
Regards,
Willy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] ct_sync 0.15 (corrected)
2004-09-02 5:10 ` Willy Tarreau
@ 2004-09-02 12:39 ` KOVACS Krisztian
0 siblings, 0 replies; 55+ messages in thread
From: KOVACS Krisztian @ 2004-09-02 12:39 UTC (permalink / raw)
To: Willy Tarreau; +Cc: netfilter-devel, jamal, Netfilter-failover list
Hi,
2004-09-02, cs keltezéssel 07:10-kor Willy Tarreau ezt írta:
> > One more
> > example could be TCP window tracking, I don't think we have
> > the necessary bandwidth and CPU time to send an update message
> > after each and every received TCP packet... Any idea how we
> > could solve these problems?
>
> I don't really agree on the bandwidth argument : a small TCP packet with
> only an ACK or a control flag is 40 bytes. This is 12 bytes more than the
> smallest UDP packet, so if you can update a connection with at most 12 bytes,
> you can use links of the same nature. BTW, I don't think you should send one
> packet per connection. You should queue updates into a list, and build the
> update packet from this list. This way, you eliminate the IP+UDP header (28
> bytes) for all updates except the first one, which means you then have 40
> bytes to update a connection without using more bandwidth. I'm not saying
> this is much enough for every case, but I think that depending on the type
> of message (creation, update, destruction), we might do things with this.
We already have such a queuing facility, a packet is sent only in case
of timeout (2s) or if it is full. The problem is the size of the
messages. At the moment, ct_sync always sends completely self-contained
updates: that is, a single update message contains every bit of
information about a conntrack entry. Because of this, the size of an
update message is 240 bytes (+ 4 byte for the message header). Because
of this, you can have about five update messages per packet, which is
not too much... (There are plans to implement a more fine-grained update
mechanism, but I don't know who will have the time to implement that.)
> If you don't want to synchronize TCP windows, you might also turn the slave
> in a "lazy mode" for existing connections when it becomes master. This is a
> bit dirty, but might be an acceptable trade-off.
Yes, this is exactly what Jozsef suggested.
--
Regards,
Krisztian KOVACS
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] ct_sync 0.15 (corrected)
2004-08-13 14:26 [RFC] ct_sync 0.15 (corrected) KOVACS Krisztian
` (2 preceding siblings ...)
2004-09-02 5:10 ` Willy Tarreau
@ 2004-09-24 2:42 ` jamal
2004-09-25 7:52 ` [nf-failover] " Harald Welte
3 siblings, 1 reply; 55+ messages in thread
From: jamal @ 2004-09-24 2:42 UTC (permalink / raw)
To: KOVACS Krisztian; +Cc: netfilter-devel, Netfilter-failover list
Hi Krisztian,
I just glanced over your code (30 sec scan) and your state machine
doesnt allow for active/active (i.e two masters).
I havent actually run it - can you confirm this is impossible?
if ct_sync was blind i.e it just did what it was told "become master" or
"become slave" regardless of who else is master, then it would be more
usable - leave policy to whatever tells it to switch.
cheers,
jamal
^ permalink raw reply [flat|nested] 55+ messages in thread* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-24 2:42 ` jamal
@ 2004-09-25 7:52 ` Harald Welte
2004-09-27 13:07 ` jamal
0 siblings, 1 reply; 55+ messages in thread
From: Harald Welte @ 2004-09-25 7:52 UTC (permalink / raw)
To: jamal; +Cc: netfilter-devel, Netfilter-failover list, KOVACS Krisztian
[-- Attachment #1: Type: text/plain, Size: 1156 bytes --]
On Thu, Sep 23, 2004 at 10:42:19PM -0400, jamal wrote:
> Hi Krisztian,
>
> I just glanced over your code (30 sec scan) and your state machine
> doesnt allow for active/active (i.e two masters).
yes, this is not a supported mode of operation in this first
implementation.
> I havent actually run it - can you confirm this is impossible?
> if ct_sync was blind i.e it just did what it was told "become master" or
> "become slave" regardless of who else is master, then it would be more
> usable - leave policy to whatever tells it to switch.
well it exactly does this, with an additional security: A master will
be downgraded to slave as soon as another master announces itself. This
is a security guard against invalid mode of operation.
> cheers,
> jamal
--
- Harald Welte <laforge@netfilter.org> http://www.netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-25 7:52 ` [nf-failover] " Harald Welte
@ 2004-09-27 13:07 ` jamal
2004-09-27 13:30 ` KOVACS Krisztian
2004-09-27 13:39 ` Harald Welte
0 siblings, 2 replies; 55+ messages in thread
From: jamal @ 2004-09-27 13:07 UTC (permalink / raw)
To: Harald Welte; +Cc: netfilter-devel, Netfilter-failover list, KOVACS Krisztian
Hi Harald,
On Sat, 2004-09-25 at 03:52, Harald Welte wrote:
> > I havent actually run it - can you confirm this is impossible?
> > if ct_sync was blind i.e it just did what it was told "become master" or
> > "become slave" regardless of who else is master, then it would be more
> > usable - leave policy to whatever tells it to switch.
>
> well it exactly does this, with an additional security: A master will
> be downgraded to slave as soon as another master announces itself. This
> is a security guard against invalid mode of operation.
I think it would be better to separate the election process (who is
master) from the syncing code. For some reason i thought this separation
was there (and that all you had to do was bag some /proc entry).
i.e if VRRP is the code that makes the decision that it wants you to be
the master, thats how you become master. If someothericandohabetter (eg
forCES) protocol wants you to be the master, thats how you become the
master. There is no point in inventing a new HA scheme.
And if you do it should probably be in user space (there is no
perfomance issues with it) i.e its such a protocol (in user space) that
should tell your syncing code to send syncs or not.
Unless i missed something fundamental.
cheers,
jamal
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-27 13:07 ` jamal
@ 2004-09-27 13:30 ` KOVACS Krisztian
2004-09-27 13:39 ` Harald Welte
1 sibling, 0 replies; 55+ messages in thread
From: KOVACS Krisztian @ 2004-09-27 13:30 UTC (permalink / raw)
To: hadi; +Cc: Harald Welte, netfilter-devel, Netfilter-failover list
Hi,
2004-09-27, h keltezéssel 15:07-kor jamal ezt írta:
> > well it exactly does this, with an additional security: A master will
> > be downgraded to slave as soon as another master announces itself. This
> > is a security guard against invalid mode of operation.
>
> I think it would be better to separate the election process (who is
> master) from the syncing code. For some reason i thought this separation
> was there (and that all you had to do was bag some /proc entry).
> i.e if VRRP is the code that makes the decision that it wants you to be
> the master, thats how you become master. If someothericandohabetter (eg
> forCES) protocol wants you to be the master, thats how you become the
> master. There is no point in inventing a new HA scheme.
> And if you do it should probably be in user space (there is no
> perfomance issues with it) i.e its such a protocol (in user space) that
> should tell your syncing code to send syncs or not.
> Unless i missed something fundamental.
The election process is completely independent, as Harald already
mentioned. The procfs interface is provided so that ct_sync can be used
with any other cluster manager/failover daemon.
However, ct_sync is not capable of load balancing at the moment, and
not just because the protocol has some things which are single-master
specific. The main problem is with NAT and preserving the uniqueness of
tuples in the whole cluster, and unfortunately this would make a lot of
things much more complicated. So, even if the protocol would be
completely multi-master compatible ct_sync would be capable of
single-master operation.
You're right that the protocol itself was designed with failover in
mind, and it won't support load balancing clusters without modification.
This is mainly because of the sequence numbers, and could be easily
corrected by maintaining a per-node seqno state on each node. The
security guard against multiple masters was implemented just to make
sure that we won't have multiple masters even if some administrator is
experimenting with ct_sync and the proc interface.
I don't consider these issues as fundamental flaws, we tried to make
the election process as independent from ct_sync as possible.
--
Regards,
Krisztian KOVACS
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-27 13:07 ` jamal
2004-09-27 13:30 ` KOVACS Krisztian
@ 2004-09-27 13:39 ` Harald Welte
2004-09-28 2:41 ` jamal
1 sibling, 1 reply; 55+ messages in thread
From: Harald Welte @ 2004-09-27 13:39 UTC (permalink / raw)
To: jamal; +Cc: netfilter-devel, Netfilter-failover list, KOVACS Krisztian
[-- Attachment #1: Type: text/plain, Size: 1775 bytes --]
On Mon, Sep 27, 2004 at 09:07:53AM -0400, jamal wrote:
> I think it would be better to separate the election process (who is
> master) from the syncing code. For some reason i thought this separation
> was there (and that all you had to do was bag some /proc entry).
> i.e if VRRP is the code that makes the decision that it wants you to be
> the master, thats how you become master. If someothericandohabetter (eg
> forCES) protocol wants you to be the master, thats how you become the
> master. There is no point in inventing a new HA scheme.
> And if you do it should probably be in user space (there is no
> perfomance issues with it) i.e its such a protocol (in user space) that
> should tell your syncing code to send syncs or not.
> Unless i missed something fundamental.
I totally agree with you, jamal. And in fact this is exactly what we
have. You tell one box it is master, and it becomes master. You tell
a box it is slave, and it becomes slave.
There is just a minor addition in one case, where we want to safeguard
against a (currently) invalid mode of operation. As soon as ct_sync
supports multiple master, this safeguard will certainly be removed.
But for the current code, unless somebody shows to me that it severely
limits some use of ct_sync, or it causes practical problems, I don't see
why we should remove this safeguard.
> cheers,
> jamal
--
- Harald Welte <laforge@netfilter.org> http://www.netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-27 13:39 ` Harald Welte
@ 2004-09-28 2:41 ` jamal
2004-09-28 6:46 ` Henrik Nordstrom
0 siblings, 1 reply; 55+ messages in thread
From: jamal @ 2004-09-28 2:41 UTC (permalink / raw)
To: Harald Welte; +Cc: netfilter-devel, Netfilter-failover list, KOVACS Krisztian
Harald,
On Mon, 2004-09-27 at 09:39, Harald Welte wrote:
> I totally agree with you, jamal. And in fact this is exactly what we
> have. You tell one box it is master, and it becomes master. You tell
> a box it is slave, and it becomes slave.
>
> There is just a minor addition in one case, where we want to safeguard
> against a (currently) invalid mode of operation. As soon as ct_sync
> supports multiple master, this safeguard will certainly be removed.
So if understood correctly the issue is as described by Krisztian:
On Mon, 2004-09-27 at 09:30, KOVACS Krisztian wrote:
> The main problem is with NAT and preserving the uniqueness of
> tuples in the whole cluster, and unfortunately this would make a lot of
> things much more complicated. So, even if the protocol would be
> completely multi-master compatible ct_sync would be capable of
> single-master operation.
If you have two machines A and B, assuming they are symetric (exactly
same internal and external IPs) then i should be able to send state from
A->B and B->A and have both B and A updated (if such state doesnt exist).
With above if i decide i want to have two nodes as master, they both
generate and accept state update messages.
> But for the current code, unless somebody shows to me that it severely
> limits some use of ct_sync, or it causes practical problems, I don't see
> why we should remove this safeguard.
As i see it you need 3 states:
Master - accepts and generates sync messages
slave - only accepts syncs
init - unknown; does neither
If you didnt have this safeguard then i should be able to achive
master/master on two nodes (even if for starters i assume symetric
setup). ct_sync in itself should not attempt to be too smart and have a
built-in protocol IMO - there is no point in reinventing the wheel;
people have spent years researching HA protocols, good idea to just use
that.
cheers,
jamal
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-28 2:41 ` jamal
@ 2004-09-28 6:46 ` Henrik Nordstrom
2004-09-28 10:56 ` jamal
2004-09-28 11:58 ` Tobias DiPasquale
0 siblings, 2 replies; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-28 6:46 UTC (permalink / raw)
To: jamal
Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
KOVACS Krisztian
On Tue, 27 Sep 2004, jamal wrote:
> On Mon, 2004-09-27 at 09:30, KOVACS Krisztian wrote:
>> The main problem is with NAT and preserving the uniqueness of
>> tuples in the whole cluster, and unfortunately this would make a lot of
>> things much more complicated. So, even if the protocol would be
>> completely multi-master compatible ct_sync would be capable of
>> single-master operation.
>
> If you have two machines A and B, assuming they are symetric (exactly
> same internal and external IPs) then i should be able to send state from
> A->B and B->A and have both B and A updated (if such state doesnt exist).
No, this is about a different issue entirely.
Lets assume you have two Active-Active gateways G and H, two clients A and
B and one server S. On the gateway NAT is used to masquerade all traffic
to a single external IP address.
Due to the Active-Active setup traffic from A goes via the gateway G and
traffic from B goes via H.
Now you have a SYN from A,31285 to S,80 and also a SYN sent by B,31285 to
S,80. You then end up with two identical NAT assignments and the two
connections will conflict with each other.
> With above if i decide i want to have two nodes as master, they both
> generate and accept state update messages.
Which in itself is not an issue, but the issue is how to ensure these
updates does not conflict with each other as in the example above.
The active-active or active-backup aspect of the syncronization protocol
is trivial. How to ensure there won't be serious session conflicts between
the connection information of the two gateways is the tricky part in order
to be able to provide active-active configurations.
Regards
Henrik
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-28 6:46 ` Henrik Nordstrom
@ 2004-09-28 10:56 ` jamal
2004-09-28 12:24 ` KOVACS Krisztian
2004-09-28 11:58 ` Tobias DiPasquale
1 sibling, 1 reply; 55+ messages in thread
From: jamal @ 2004-09-28 10:56 UTC (permalink / raw)
To: Henrik Nordstrom
Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
KOVACS Krisztian
On Tue, 2004-09-28 at 02:46, Henrik Nordstrom wrote:
> Lets assume you have two Active-Active gateways G and H, two clients A and
> B and one server S. On the gateway NAT is used to masquerade all traffic
> to a single external IP address.
>
> Due to the Active-Active setup traffic from A goes via the gateway G and
> traffic from B goes via H.
>
> Now you have a SYN from A,31285 to S,80 and also a SYN sent by B,31285 to
> S,80. You then end up with two identical NAT assignments and the two
> connections will conflict with each other.
if we even look at the 5 tuples {srcIP,DstIP, proto, srcport,dstport} we
already have a distinction, no?
i.e in the example you provide srcIP would be different.
I can see an issue if those 5 tuples match and you have to find
something else to distinguish them since Linux contracking doesnt keep
track of TCP sequence numbers and window dilations. If it did i dont see
why this would be a problem. I think i am having a hard time visualizing
when you would even need to kick in sequence number checks,
> > With above if i decide i want to have two nodes as master, they both
> > generate and accept state update messages.
>
> Which in itself is not an issue, but the issue is how to ensure these
> updates does not conflict with each other as in the example above.
>
> The active-active or active-backup aspect of the syncronization protocol
> is trivial. How to ensure there won't be serious session conflicts between
> the connection information of the two gateways is the tricky part in order
> to be able to provide active-active configurations.
See my comment above.
I dont think i see a serious issue of conflict. I may be missing
something of course. At least the srcIP may endup being a tiebreaker.
To get exactly the same 5 tuples from the same physical machine for a
different flow is impossible i would think.
Again i may be missing something.
cheers,
jamal
^ permalink raw reply [flat|nested] 55+ messages in thread* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-28 10:56 ` jamal
@ 2004-09-28 12:24 ` KOVACS Krisztian
2004-09-28 12:35 ` Henrik Nordstrom
0 siblings, 1 reply; 55+ messages in thread
From: KOVACS Krisztian @ 2004-09-28 12:24 UTC (permalink / raw)
To: hadi
Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
Henrik Nordstrom
Hi,
2004-09-28, k keltezéssel 12:56-kor jamal ezt írta:
> On Tue, 2004-09-28 at 02:46, Henrik Nordstrom wrote:
>
> > Lets assume you have two Active-Active gateways G and H, two clients A and
> > B and one server S. On the gateway NAT is used to masquerade all traffic
> > to a single external IP address.
> >
> > Due to the Active-Active setup traffic from A goes via the gateway G and
> > traffic from B goes via H.
> >
> > Now you have a SYN from A,31285 to S,80 and also a SYN sent by B,31285 to
> > S,80. You then end up with two identical NAT assignments and the two
> > connections will conflict with each other.
>
> if we even look at the 5 tuples {srcIP,DstIP, proto, srcport,dstport} we
> already have a distinction, no?
> i.e in the example you provide srcIP would be different.
>
> I can see an issue if those 5 tuples match and you have to find
> something else to distinguish them since Linux contracking doesnt keep
> track of TCP sequence numbers and window dilations. If it did i dont see
> why this would be a problem. I think i am having a hard time visualizing
> when you would even need to kick in sequence number checks,
Not necessarily. You cannot (easily) decide which conntrack the reply
packets belong to. So, in the above scenario the following is perfectly
possible:
A -------- G\
--------S
B -------- H/
Let's suppose the SYN packages from A and B arrive to G and H
simultaneously, so that neither G nor H knows anything about the other
connection yet. When the NAT core searches for a suitable new source
address for the connection, it is possible that each node will choose
exactly the same new source IP:port pair. (Because they obviously do
uniqueness checks based only on their own conntrack state table.) And if
the two connections were destined to the same IP:port, the reply packets
for both connections will look exactly the same. In case of TCP you
could probably make guesses based on sequence numbers and such, but what
would you do in case of other protocols?
The problem could be circumvented if we statically partitioned the
address space between the nodes in the cluster. Unfortunately this is
not so simple as it sounds, since it is possible to have untranslated
connections using the possibly clasing tuples as well... (Maybe we could
apply implicit SNAT translations in this case?)
--
Regards,
Krisztian KOVACS
^ permalink raw reply [flat|nested] 55+ messages in thread* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-28 12:24 ` KOVACS Krisztian
@ 2004-09-28 12:35 ` Henrik Nordstrom
2004-09-28 12:57 ` KOVACS Krisztian
0 siblings, 1 reply; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-28 12:35 UTC (permalink / raw)
To: KOVACS Krisztian
Cc: Harald Welte, netfilter-devel, hadi, Netfilter-failover list
On Tue, 28 Sep 2004, KOVACS Krisztian wrote:
> The problem could be circumvented if we statically partitioned the
> address space between the nodes in the cluster. Unfortunately this is
> not so simple as it sounds, since it is possible to have untranslated
> connections using the possibly clasing tuples as well... (Maybe we could
> apply implicit SNAT translations in this case?)
I think for the active-active case the only viable setup is to enforce
strict address separation, with the addresses used for NAT not used for
anything else, and unique per firewall in the active-active cluster.
This is not as bad as it sounds as the traffic needs to be partitioned as
well. We certainly do not want to see assymetric flows in conntrack where
traffic goes out via one gateway and returns on another.
Regards
Henrik
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-28 12:35 ` Henrik Nordstrom
@ 2004-09-28 12:57 ` KOVACS Krisztian
2004-09-28 13:14 ` jamal
2004-09-28 13:58 ` Henrik Nordstrom
0 siblings, 2 replies; 55+ messages in thread
From: KOVACS Krisztian @ 2004-09-28 12:57 UTC (permalink / raw)
To: Henrik Nordstrom
Cc: Harald Welte, netfilter-devel, hadi, Netfilter-failover list
Hi,
2004-09-28, k keltezéssel 14:35-kor Henrik Nordstrom ezt írta:
> > The problem could be circumvented if we statically partitioned the
> > address space between the nodes in the cluster. Unfortunately this is
> > not so simple as it sounds, since it is possible to have untranslated
> > connections using the possibly clasing tuples as well... (Maybe we could
> > apply implicit SNAT translations in this case?)
>
> I think for the active-active case the only viable setup is to enforce
> strict address separation, with the addresses used for NAT not used for
> anything else, and unique per firewall in the active-active cluster.
>
> This is not as bad as it sounds as the traffic needs to be partitioned as
> well. We certainly do not want to see assymetric flows in conntrack where
> traffic goes out via one gateway and returns on another.
There are other solutions for that problem, for example Harald's
ClusterIP code. If we could integrate that with ct_sync we would be able
to do multi-master packet filter clusters without any load balancers
before the cluster. If the NAT core would be integrated with ClusterIP's
hash to avoid conntrack clashes we could do this without statically
assigning different NAT addresses to each node.
--
Regards,
Krisztian KOVACS
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-28 12:57 ` KOVACS Krisztian
@ 2004-09-28 13:14 ` jamal
[not found] ` <1096379957.1026.5.camel@jzny.localdomain>
2004-09-28 13:58 ` Henrik Nordstrom
1 sibling, 1 reply; 55+ messages in thread
From: jamal @ 2004-09-28 13:14 UTC (permalink / raw)
To: KOVACS Krisztian
Cc: Harald Welte, netfilter-devel, Netfilter-failover list,
Henrik Nordstrom
BTW, thanks to both of you - I got what the challenge is now.
For some reason - I am still missing pieces of the
1) --> A sends update for IPx:portY
2) --> B updates its state with new pair.
3) --> B generates update for IPx:portz
4) --> A updates its state with new pair.
in #2 above B reserves that space and never uses it (same in #3 for A).
In otherwords when B generates #3, it ensures no conflict by
definition.
You may need to synchronize and generate a "conflict detected" flag in
updates; i would suspect very little conflict though.
cheers,
jamal
On Tue, 2004-09-28 at 08:57, KOVACS Krisztian wrote:
> Hi,
>
> 2004-09-28, k keltezéssel 14:35-kor Henrik Nordstrom ezt írta:
> > > The problem could be circumvented if we statically partitioned the
> > > address space between the nodes in the cluster. Unfortunately this is
> > > not so simple as it sounds, since it is possible to have untranslated
> > > connections using the possibly clasing tuples as well... (Maybe we could
> > > apply implicit SNAT translations in this case?)
> >
> > I think for the active-active case the only viable setup is to enforce
> > strict address separation, with the addresses used for NAT not used for
> > anything else, and unique per firewall in the active-active cluster.
> >
> > This is not as bad as it sounds as the traffic needs to be partitioned as
> > well. We certainly do not want to see assymetric flows in conntrack where
> > traffic goes out via one gateway and returns on another.
>
> There are other solutions for that problem, for example Harald's
> ClusterIP code. If we could integrate that with ct_sync we would be able
> to do multi-master packet filter clusters without any load balancers
> before the cluster. If the NAT core would be integrated with ClusterIP's
> hash to avoid conntrack clashes we could do this without statically
> assigning different NAT addresses to each node.
^ permalink raw reply [flat|nested] 55+ messages in thread* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-28 12:57 ` KOVACS Krisztian
2004-09-28 13:14 ` jamal
@ 2004-09-28 13:58 ` Henrik Nordstrom
2004-09-28 14:24 ` Tobias DiPasquale
1 sibling, 1 reply; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-28 13:58 UTC (permalink / raw)
To: KOVACS Krisztian
Cc: Harald Welte, netfilter-devel, hadi, Netfilter-failover list
On Tue, 28 Sep 2004, KOVACS Krisztian wrote:
> There are other solutions for that problem, for example Harald's
> ClusterIP code. If we could integrate that with ct_sync we would be able
> to do multi-master packet filter clusters without any load balancers
> before the cluster. If the NAT core would be integrated with ClusterIP's
> hash to avoid conntrack clashes we could do this without statically
> assigning different NAT addresses to each node.
Any ideas on how would this work?
Lets reason around the common MASQUERADE case where an internal network
needs to be SNAT:ed when going out to the Internet.
Regards
Henrik
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-28 13:58 ` Henrik Nordstrom
@ 2004-09-28 14:24 ` Tobias DiPasquale
0 siblings, 0 replies; 55+ messages in thread
From: Tobias DiPasquale @ 2004-09-28 14:24 UTC (permalink / raw)
To: Henrik Nordstrom
Cc: Harald Welte, netfilter-devel, hadi, Netfilter-failover list,
KOVACS Krisztian
On Tue, 28 Sep 2004 15:58:52 +0200 (CEST), Henrik Nordstrom
<hno@marasystems.com> wrote:
> On Tue, 28 Sep 2004, KOVACS Krisztian wrote:
>
> > There are other solutions for that problem, for example Harald's
> > ClusterIP code. If we could integrate that with ct_sync we would be able
> > to do multi-master packet filter clusters without any load balancers
> > before the cluster. If the NAT core would be integrated with ClusterIP's
> > hash to avoid conntrack clashes we could do this without statically
> > assigning different NAT addresses to each node.
>
> Any ideas on how would this work?
>
> Lets reason around the common MASQUERADE case where an internal network
> needs to be SNAT:ed when going out to the Internet.
Forgive me for bringing this back up, but...
I believe that Saru handles this problem by assigning "blocks" (a
block being a fixed-sized range of units, e.g. 512 source ports in
sequence) of IPs and ports to various nodes in the cluster and each
node only handles the IP/ports in its assigned blocks. The lookup is
just a bitop so its fast and this would handle the MASQUERADE case
mentioned above nicely. The blocks are handed out by a userspace
daemon as nodes enter and leave the cluster.
Would this not work?
--
[ Tobias DiPasquale ]
0x636f6465736c696e67657240676d61696c2e636f6d
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-28 6:46 ` Henrik Nordstrom
2004-09-28 10:56 ` jamal
@ 2004-09-28 11:58 ` Tobias DiPasquale
2004-09-28 12:11 ` KOVACS Krisztian
2004-09-28 12:31 ` Henrik Nordstrom
1 sibling, 2 replies; 55+ messages in thread
From: Tobias DiPasquale @ 2004-09-28 11:58 UTC (permalink / raw)
To: Henrik Nordstrom; +Cc: nf-devel, netfilter-failover
On Tue, 28 Sep 2004 08:46:25 +0200 (CEST), Henrik Nordstrom
<hno@marasystems.com> wrote:
> No, this is about a different issue entirely.
>
> Lets assume you have two Active-Active gateways G and H, two clients A and
> B and one server S. On the gateway NAT is used to masquerade all traffic
> to a single external IP address.
>
> Due to the Active-Active setup traffic from A goes via the gateway G and
> traffic from B goes via H.
>
> Now you have a SYN from A,31285 to S,80 and also a SYN sent by B,31285 to
> S,80. You then end up with two identical NAT assignments and the two
> connections will conflict with each other.
Why use NAT at all for active-active? Its pretty slow in comparison to
the shared MAC/IP schema delineated at UltraMonkey.org:
http://www.ultramonkey.org/papers/active_active/active_active.shtml
Am I missing something? Is NAT required for some reason?
--
[ Tobias DiPasquale ]
0x636f6465736c696e67657240676d61696c2e636f6d
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-28 11:58 ` Tobias DiPasquale
@ 2004-09-28 12:11 ` KOVACS Krisztian
2004-09-28 12:31 ` Henrik Nordstrom
1 sibling, 0 replies; 55+ messages in thread
From: KOVACS Krisztian @ 2004-09-28 12:11 UTC (permalink / raw)
To: Tobias DiPasquale; +Cc: nf-devel, Netfilter-failover list, Henrik Nordstrom
Hi,
2004-09-28, k keltezéssel 13:58-kor Tobias DiPasquale ezt írta:
> On Tue, 28 Sep 2004 08:46:25 +0200 (CEST), Henrik Nordstrom
> <hno@marasystems.com> wrote:
> > No, this is about a different issue entirely.
> >
> > Lets assume you have two Active-Active gateways G and H, two clients A and
> > B and one server S. On the gateway NAT is used to masquerade all traffic
> > to a single external IP address.
> >
> > Due to the Active-Active setup traffic from A goes via the gateway G and
> > traffic from B goes via H.
> >
> > Now you have a SYN from A,31285 to S,80 and also a SYN sent by B,31285 to
> > S,80. You then end up with two identical NAT assignments and the two
> > connections will conflict with each other.
>
> Why use NAT at all for active-active? Its pretty slow in comparison to
> the shared MAC/IP schema delineated at UltraMonkey.org:
>
> http://www.ultramonkey.org/papers/active_active/active_active.shtml
>
> Am I missing something? Is NAT required for some reason?
Ok, but this is not redirector failover, ct_sync is simply a general
purpose conntrack state replication solution. And as such, it should be
able to handle NAT-related conntrack data as well. If you have a
multi-master (load balancing) packet filter cluster it still has to be
able to do anything you can do with a single node.
--
Regards,
Krisztian KOVACS
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [nf-failover] Re: [RFC] ct_sync 0.15 (corrected)
2004-09-28 11:58 ` Tobias DiPasquale
2004-09-28 12:11 ` KOVACS Krisztian
@ 2004-09-28 12:31 ` Henrik Nordstrom
1 sibling, 0 replies; 55+ messages in thread
From: Henrik Nordstrom @ 2004-09-28 12:31 UTC (permalink / raw)
To: Tobias DiPasquale; +Cc: nf-devel, netfilter-failover
On Tue, 28 Sep 2004, Tobias DiPasquale wrote:
> Why use NAT at all for active-active? Its pretty slow in comparison to
> the shared MAC/IP schema delineated at UltraMonkey.org:
We are talking firewalls here, not loadbalancers.
NAT of the traffic forwarded, not NAT to reach the box.
> Am I missing something? Is NAT required for some reason?
By the firewall policy implemented by the firewall.
Regards
Henrik
^ permalink raw reply [flat|nested] 55+ messages in thread
end of thread, other threads:[~2004-09-30 12:24 UTC | newest]
Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-13 14:26 [RFC] ct_sync 0.15 (corrected) KOVACS Krisztian
2004-08-19 11:06 ` Harald Welte
2004-08-19 12:13 ` KOVACS Krisztian
2004-08-26 10:00 ` Jozsef Kadlecsik
2004-08-26 11:12 ` KOVACS Krisztian
2004-08-26 11:39 ` Jozsef Kadlecsik
2004-08-26 16:14 ` [nf-failover] " KOVACS Krisztian
2004-08-19 12:13 ` KOVACS Krisztian
2004-08-19 16:13 ` Henrik Nordstrom
2004-08-22 20:43 ` KOVACS Krisztian
2004-08-24 18:37 ` Harald Welte
2004-08-25 11:41 ` jamal
2004-08-22 0:40 ` Patrick McHardy
2004-08-22 7:49 ` [nf-failover] " KOVACS Krisztian
2004-08-22 20:42 ` Sven Schuster
2004-08-23 9:51 ` Patrick McHardy
2004-09-02 5:10 ` Willy Tarreau
2004-09-02 12:39 ` KOVACS Krisztian
2004-09-24 2:42 ` jamal
2004-09-25 7:52 ` [nf-failover] " Harald Welte
2004-09-27 13:07 ` jamal
2004-09-27 13:30 ` KOVACS Krisztian
2004-09-27 13:39 ` Harald Welte
2004-09-28 2:41 ` jamal
2004-09-28 6:46 ` Henrik Nordstrom
2004-09-28 10:56 ` jamal
2004-09-28 12:24 ` KOVACS Krisztian
2004-09-28 12:35 ` Henrik Nordstrom
2004-09-28 12:57 ` KOVACS Krisztian
2004-09-28 13:14 ` jamal
[not found] ` <1096379957.1026.5.camel@jzny.localdomain>
2004-09-28 14:46 ` Henrik Nordstrom
2004-09-28 14:56 ` KOVACS Krisztian
2004-09-28 15:07 ` Henrik Nordstrom
2004-09-28 18:04 ` Sven Schuster
2004-09-28 18:47 ` Henrik Nordstrom
2004-09-28 20:57 ` Sven Schuster
2004-09-28 22:30 ` Tobias DiPasquale
2004-09-28 23:36 ` Henrik Nordstrom
2004-09-29 3:00 ` Tobias DiPasquale
2004-09-29 8:34 ` Henrik Nordstrom
2004-09-29 2:14 ` Jamal Hadi Salim
2004-09-29 8:12 ` Henrik Nordstrom
2004-09-29 11:13 ` Jamal Hadi Salim
2004-09-29 11:29 ` KOVACS Krisztian
2004-09-29 11:44 ` Henrik Nordstrom
2004-09-29 13:03 ` Jamal Hadi Salim
2004-09-29 13:41 ` Henrik Nordstrom
2004-09-29 14:23 ` jamal
2004-09-29 15:02 ` Henrik Nordstrom
2004-09-30 12:24 ` jamal
2004-09-28 13:58 ` Henrik Nordstrom
2004-09-28 14:24 ` Tobias DiPasquale
2004-09-28 11:58 ` Tobias DiPasquale
2004-09-28 12:11 ` KOVACS Krisztian
2004-09-28 12:31 ` Henrik Nordstrom
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.