* failing fail-over - commit still in progress
@ 2023-08-11 8:55 Pierre-Philipp Braun
2023-08-11 8:58 ` Pierre-Philipp Braun
2023-08-11 10:53 ` Pablo Neira Ayuso
0 siblings, 2 replies; 10+ messages in thread
From: Pierre-Philipp Braun @ 2023-08-11 8:55 UTC (permalink / raw)
To: netfilter
Hello
I have a casual NAT active/passive setup with keepalived+conntrackd, on three nodes. I am trying to validate a fail-over on inbound traffic: an open SSH connection, initiated from the outside, taking advantage of a DNAT rule that points to a system behind the NAT.
Here for example from within the ssh session:
tcp 0 52 10.1.0.50:22 178.205.50.68:27531 ESTABLISHED 288/sshd: root@pts/
I can see the state from the active node:
# internal cache
tcp 6 ESTABLISHED src=178.205.50.68 dst=217.19.208.157 sport=27531 dport=50 src=10.1.0.50 dst=178.205.50.68 sport=22 dport=27531 [ASSURED] [active since 237s]
it's absent on node2, as we are in active/passive mode.
# external cache
tcp 6 ESTABLISHED src=178.205.50.68 dst=10.1.0.50 sport=27531 dport=22 [ASSURED] [active since 403s]
I can also see it on node3, although I did not disable external caches:
# internal cache
tcp 6 ESTABLISHED src=178.205.50.68 dst=10.1.0.50 sport=27531 dport=22 src=10.1.0.50 dst=178.205.50.68 sport=22 dport=27531 [ASSURED] [active since 217s]
# external cache
(not there)
Why? Because node1,2,3 are XEN virtual machine monitors that actually host guests, aside from serving NAT for them.
So here we go, this is what happens when I kill keepalived on the active node (currently node1).
node2 shows:
[Fri Aug 11 11:41:59 2023] (pid=14642) [notice] committing all external caches
[Fri Aug 11 11:41:59 2023] (pid=14642) [notice] Committed 71 new entries
[Fri Aug 11 11:41:59 2023] (pid=14642) [notice] commit has taken 0.000558 seconds
[Fri Aug 11 11:41:59 2023] (pid=14642) [notice] flushing conntrack table in 60 secs
[Fri Aug 11 11:41:59 2023] (pid=14642) [ERROR] ignoring flush command, commit still in progress
[Fri Aug 11 11:41:59 2023] (pid=14642) [notice] resync requested
[Fri Aug 11 11:41:59 2023] (pid=14642) [notice] resync with master conntrack table
[Fri Aug 11 11:41:59 2023] (pid=14642) [notice] sending bulk update
[Fri Aug 11 11:42:59 2023] (pid=14642) [notice] flushing kernel conntrack table (scheduled)
and node3 shows:
[Fri Aug 11 11:41:59 2023] (pid=25228) [notice] committing all external caches
[Fri Aug 11 11:41:59 2023] (pid=25228) [notice] Committed 3 new entries
[Fri Aug 11 11:41:59 2023] (pid=25228) [notice] commit has taken 0.000069 seconds
[Fri Aug 11 11:41:59 2023] (pid=25228) [ERROR] ignoring flush command, commit still in progress
[Fri Aug 11 11:41:59 2023] (pid=25228) [notice] resync with master conntrack table
[Fri Aug 11 11:41:59 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:41:59 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:41:59 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:42:00 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:42:00 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:42:01 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:42:01 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:42:02 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:42:02 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:42:03 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:42:03 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:42:04 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:42:04 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:42:05 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:42:05 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:42:06 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:42:06 2023] (pid=25228) [notice] sending bulk update
...
When I try to commit manually, it doesn't say another commit is in progress.
But since -c ends once it finishes, I guess that means there's either some conflicting commits going on (I don't see where, as keepalived only calls the primary script once on the new active node)
--or-- something related my network setup and eventually the discrepancy noticed above (known state on the backup) makes it so that there's a conflict.
versions:
Linux 5.16.20
nftables v1.0.1 (Fearless Fosdick #3)
Keepalived v2.2.8
Connection tracking userspace daemon v1.4.7 (GIT master branch)
nftables.conf:
define nic=xenbr0
define gst=guestbr0
table inet filter
flush table inet filter
table inet filter {
chain input {
type filter hook input priority filter; policy accept;
ip protocol icmp accept
ip6 nexthdr ipv6-icmp accept
#ip protocol vrrp ip daddr 224.0.0.0/8 accept
ip protocol vrrp accept
#iif $nic tcp dport 1-3000 accept
#iif $nic tcp dport 64999 accept
# conntrackd wants drop
#iif $nic ct state established,related accept
#iif $nic drop
#iif $gst ct state established,related accept
#iif $gst drop
}
# NAT --> accept
chain forward {
type filter hook forward priority filter; policy accept;
}
chain output {
type filter hook output priority filter; policy accept;
ip protocol icmp accept
ip6 nexthdr ipv6-icmp accept
#ip protocol vrrp ip saddr 224.0.0.0/8 accept
ip protocol vrrp accept
# conntrack wants drop
#oif $gst ct state established,related accept
#oif $gst drop
}
}
table ip nat
flush table ip nat
table ip nat {
chain postrouting {
type nat hook postrouting priority srcnat;
ip saddr 10.1.0.0/16 oif $nic snat 217.19.208.154;
#ip saddr 10.1.0.0/16 oif $nic snat 217.19.208.157;
}
chain prerouting {
type nat hook prerouting priority dstnat;
...
iif $nic tcp dport 50 dnat 10.1.0.50:22;
...
}
}
keepalived.conf:
max_auto_priority -1
notification_email {
support@angrycow.ru
}
notification_email_from support@angrycow.ru
checker_log_all_failures
default_interface xenbr0
# need root for conntrackd
#enable_script_security
#script_user keepalive keepalive
}
vrrp_sync_group nat {
group {
front-vip
guest-vip
}
# active/passive
notify_master "/etc/conntrackd/primary-backup.bash primary"
notify_backup "/etc/conntrackd/primary-backup.bash backup"
notify_fault "/etc/conntrackd/primary-backup.bash fault"
# active/active
#notify "/var/tmp/notify.bash"
}
vrrp_instance front-vip {
state BACKUP
interface xenbr0
virtual_router_id 1
priority 1
advert_int 1
virtual_ipaddress {
217.19.208.157/29
}
# default route remains anyhow
notify "/var/tmp/notify.bash"
}
vrrp_instance guest-vip {
state BACKUP
interface guestbr0
virtual_router_id 2
priority 1
advert_int 1
virtual_ipaddress {
10.1.255.254/16
}
notify "/var/tmp/notify.bash"
}
==> same on all nodes, letting vrrp do its own election...
conntrackd.conf:
Sync {
Mode FTFW {
# casual fail-over - active/passive
DisableExternalCache off
# active/active
#DisableExternalCache on
# grab states from the past
StartupResync on
}
UDP {
IPv4_address 10.3.3.1
IPv4_Destination_Address 10.3.3.2
IPv4_Destination_Address 10.3.3.3
Port 3780
Interface br0
SndSocketBuffer 1249280
RcvSocketBuffer 1249280
Checksum on
}
}
General {
Systemd off
HashSize 8192
# 2 x /proc/sys/net/netfilter/nf_conntrack_max
HashLimit 131072
LogFile on
Syslog off
LockFile /var/lock/conntrack.lock
NetlinkBufferSize 2097152
NetlinkBufferSizeMaxGrowth 8388608
UNIX {
Path /var/run/conntrackd.ctl
}
Filter {
Protocol Accept {
TCP
#SCTP
#UDP
#ICMP
}
Address Ignore {
IPv4_address 127.0.0.1
IPv6_address ::1
# don't track cluster/storage network
IPv4_address 10.3.3.0/24
}
State Accept {
ESTABLISHED CLOSED TIME_WAIT CLOSE_WAIT for TCP
}
}
}
It's been hard to troubleshoot, I don't see what's wrong in my setup, please advise.
BR
-elge
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: failing fail-over - commit still in progress
2023-08-11 8:55 failing fail-over - commit still in progress Pierre-Philipp Braun
@ 2023-08-11 8:58 ` Pierre-Philipp Braun
2023-08-11 10:53 ` Pablo Neira Ayuso
1 sibling, 0 replies; 10+ messages in thread
From: Pierre-Philipp Braun @ 2023-08-11 8:58 UTC (permalink / raw)
To: netfilter
> nftables.conf:
==> same on all nodes but the outbound external IP
I also tried with the external vrrp vip itself but that won't change
anything, as we're considering inbound traffic fail-over issue.
> keepalived.conf:
> ...
> ==> same on all nodes, letting vrrp do its own election...
> conntrackd.conf:
==> same on all nodes but the UDP IPv4_address bind and destinations.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: failing fail-over - commit still in progress
2023-08-11 8:55 failing fail-over - commit still in progress Pierre-Philipp Braun
2023-08-11 8:58 ` Pierre-Philipp Braun
@ 2023-08-11 10:53 ` Pablo Neira Ayuso
2023-08-12 9:52 ` Pierre-Philipp Braun
1 sibling, 1 reply; 10+ messages in thread
From: Pablo Neira Ayuso @ 2023-08-11 10:53 UTC (permalink / raw)
To: Pierre-Philipp Braun; +Cc: netfilter
On Fri, Aug 11, 2023 at 11:55:42AM +0300, Pierre-Philipp Braun wrote:
> Hello
>
> I have a casual NAT active/passive setup with keepalived+conntrackd,
> on three nodes.
Three nodes and FT-FW mode will not work. FT-FW would need to be
extended to maintain sequence tracking for more than one single node.
It is doable but this requires development effort.
For three node, you should try NOTRACK which means sync messages are
sent from active to passive nodes without any kind of sequence
tracking (best effort approach).
[...]
> versions:
>
> Linux 5.16.20
BTW, why this kernel version? This is not any of the -stable kernels.
> nftables v1.0.1 (Fearless Fosdick #3)
> Keepalived v2.2.8
> Connection tracking userspace daemon v1.4.7 (GIT master branch)
>
> nftables.conf:
>
> define nic=xenbr0
> define gst=guestbr0
>
> table inet filter
> flush table inet filter
> table inet filter {
> chain input {
> type filter hook input priority filter; policy accept;
>
> ip protocol icmp accept
> ip6 nexthdr ipv6-icmp accept
> #ip protocol vrrp ip daddr 224.0.0.0/8 accept
> ip protocol vrrp accept
meta l4proto { icmp, ipv6-icmp, vrrp } accept
BTW, you could merge these rules with a set, to have a less iptabl-ish
ruleset.
With newer nftables version, I recommend to run -o/--optimization
option to check for ruleset optimizations.
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: failing fail-over - commit still in progress
2023-08-11 10:53 ` Pablo Neira Ayuso
@ 2023-08-12 9:52 ` Pierre-Philipp Braun
2023-08-12 21:08 ` Pablo Neira Ayuso
0 siblings, 1 reply; 10+ messages in thread
From: Pierre-Philipp Braun @ 2023-08-12 9:52 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: netfilter
> Three nodes and FT-FW mode will not work. FT-FW would need to be
> extended to maintain sequence tracking for more than one single node.
> It is doable but this requires development effort.
>
> For three node, you should try NOTRACK which means sync messages are
> sent from active to passive nodes without any kind of sequence
> tracking (best effort approach).
I switched to NOTRACK UDP but I get the same issue with the commit.
The inbound session is seen alright on all the nodes, although node3 (active vrrp) sees it both in internal and external cache.
The host where the guest lives sees it only in the internal cache this time.
(vrrp backup - here lives the guest)
pmr1: conntrack v1.4.7 (conntrack-tools): 203 flow entries have been shown.
pmr1: tcp 6 431976 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr1: internal cache
pmr1: tcp 6 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] [active since 60s]
pmr1: external cache
(vrrp backup)
pmr2: conntrack v1.4.7 (conntrack-tools): 139 flow entries have been shown.
pmr2: internal cache
pmr2: external cache
pmr2: tcp 6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 [ASSURED] [active since 60s]
(active vrrp)
pmr3: tcp 6 431976 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr3: conntrack v1.4.7 (conntrack-tools): 140 flow entries have been shown.
pmr3: internal cache
pmr3: tcp 6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED]
[active since 60s]
pmr3: external cache
pmr3: tcp 6 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 [ASSURED] [active since 60s]
going for the acceptance test, node2 because active and that's a success (for once) - I didn't loose SSH connection to the guest system.
those are the states after fail-over to node2.
(backup vrrp - guest lives there)
pmr1: conntrack v1.4.7 (conntrack-tools): 198 flow entries have been shown.
pmr1: tcp 6 431992 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr1: internal cache
pmr1: tcp 6 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] [active since 187s]
pmr1: external cache
(active vrrp)
pmr2: conntrack v1.4.7 (conntrack-tools): 148 flow entries have been shown.
pmr2: tcp 6 431992 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr2: internal cache
pmr2: tcp 6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED]
mark=0 [active since 257s]
pmr2: external cache
pmr2: tcp 6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 [ASSURED] [active since 636s]
(backup vrrp)
pmr3: tcp 6 431692 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr3: conntrack v1.4.7 (conntrack-tools): 124 flow entries have been shown.
pmr3: internal cache
pmr3: tcp 6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED]
[active since 636s]
pmr3: external cache
pmr3: tcp 6 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 [ASSURED] [active since 187s]
let us do it again! Here we go, node1 became master and I lost connection.
node1 shows
[Sat Aug 12 12:43:03 2023] (pid=24942) [notice] committing all external caches
[Sat Aug 12 12:43:03 2023] (pid=24942) [notice] Committed 0 new entries
[Sat Aug 12 12:43:03 2023] (pid=24942) [notice] commit has taken 0.000017 seconds
[Sat Aug 12 12:43:03 2023] (pid=24942) [ERROR] ignoring flush command, commit still in progress
[Sat Aug 12 12:43:03 2023] (pid=24942) [notice] resync with master conntrack table
[Sat Aug 12 12:43:03 2023] (pid=24942) [notice] sending bulk update
node2 shows
[Sat Aug 12 12:37:51 2023] (pid=20216) [notice] committing all external caches
[Sat Aug 12 12:37:51 2023] (pid=20216) [ERROR] commit-create: File exists
Sat Aug 12 12:37:51 2023 tcp 6 60 SYN_RECV src=8.222.205.118 dst=10.1.0.11 sport=39230 dport=22
[Sat Aug 12 12:37:51 2023] (pid=20216) [notice] Committed 38 new entries
[Sat Aug 12 12:37:51 2023] (pid=20216) [notice] 1 entries can't be committed
[Sat Aug 12 12:37:51 2023] (pid=20216) [notice] commit has taken 0.000560 seconds
[Sat Aug 12 12:37:51 2023] (pid=20216) [ERROR] ignoring flush command, commit still in progress
[Sat Aug 12 12:37:51 2023] (pid=20216) [notice] resync with master conntrack table
[Sat Aug 12 12:37:51 2023] (pid=20216) [notice] sending bulk update
node3 shows
[Sat Aug 12 12:37:51 2023] (pid=21424) [notice] resync requested by other node
[Sat Aug 12 12:37:51 2023] (pid=21424) [notice] sending bulk update
states referring to that previously used port (57995/tcp) are as follows
(active vrrp and guest lives there)
pmr1: conntrack v1.4.7 (conntrack-tools): 198 flow entries have been shown.
pmr1: tcp 6 431958 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr1: internal cache
pmr1: tcp 6 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 [active since 279s]
pmr1: external cache
(backup vrrp)
pmr2: conntrack v1.4.7 (conntrack-tools): 136 flow entries have been shown.
pmr2: tcp 6 431958 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr2: internal cache
pmr2: tcp 6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED]
mark=0 [active since 349s]
pmr2: external cache
pmr2: tcp 6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 [ASSURED] [active since 728s]
(backup vrrp)
pmr3: tcp 6 431600 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr3: conntrack v1.4.7 (conntrack-tools): 133 flow entries have been shown.
pmr3: internal cache
pmr3: tcp 6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED]
[active since 728s]
pmr3: external cache
pmr3: tcp 6 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 [ASSURED] mark=0 [active since 279s]
I think it's not necessarily because of the three-node setup (I tried with two nodes and afair I got same commit issue).
>> Linux 5.16.20
>
> BTW, why this kernel version? This is not any of the -stable kernels.
Because latest REISER4 *1 patch is for 5.16. I downgraded to linux longterm 5.15 for the purpose of the tests tho, to avoid having anything too exotic.
The cluster farm is currently running linux 5.15.126 + drbd v9.2.5 module.
*1 https://lab.nethence.com/fsbench/2022-10.html
> BTW, you could merge these rules with a set, to have a less iptabl-ish
> ruleset.
>
> With newer nftables version, I recommend to run -o/--optimization
> option to check for ruleset optimizations.
yup, thanks for the tip
In a further threat I would describe the issues I have if I switch to active/active mode by disabling external caches.
I've got another symptoms in that scenario.
-elge
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: failing fail-over - commit still in progress
2023-08-12 9:52 ` Pierre-Philipp Braun
@ 2023-08-12 21:08 ` Pablo Neira Ayuso
2023-08-21 6:19 ` Pierre-Philipp Braun
0 siblings, 1 reply; 10+ messages in thread
From: Pablo Neira Ayuso @ 2023-08-12 21:08 UTC (permalink / raw)
To: Pierre-Philipp Braun; +Cc: netfilter
On Sat, Aug 12, 2023 at 12:52:19PM +0300, Pierre-Philipp Braun wrote:
>
> > Three nodes and FT-FW mode will not work. FT-FW would need to be
> > extended to maintain sequence tracking for more than one single node.
> > It is doable but this requires development effort.
> >
> > For three node, you should try NOTRACK which means sync messages are
> > sent from active to passive nodes without any kind of sequence
> > tracking (best effort approach).
>
> I switched to NOTRACK UDP but I get the same issue with the commit.
>
> The inbound session is seen alright on all the nodes, although node3 (active vrrp) sees it both in internal and external cache.
> The host where the guest lives sees it only in the internal cache this time.
You should see:
- active: internal cache contains the flow that represents the SSH
connection.
- backup: external cache contains the flow that represents the SSH
connection.
on failover, what you see in the external cache in the backup node
will be visible in the internal cache.
By "inbound session", I guess you refer to the SSH connection you use
for testing, but is this a SSH connection to the guest VM? Is this
DNAT to the guest VM or simply routing?
Such guess VM gets migrated to the active node and the active node
forwards traffic to the guest VM?
From what you write, there is no state synchronization issue with
NOTRACK with three nodes.
If connection gets lost on failover, it might be also be related to
your firewall policy. If the state is not yet in conntrack, NAT
packets will be handled as local packet by the router, not the guess
itself, likely being rejecting them with TCP RST.
Dropping packets that are in invalid state is important to make sure
no races occur with state injection, your basechain policy is also set
to accept as default.
Please also check that you set:
/proc/sys/net/netfilter/nf_conntrack_tcp_loose
to zero to disable TCP connection tracking pick up on failover.
Otherwise, conntrack creates an entry from the middle.
Moreover, you will need to drop packets in invalid state in your
policy in combination with this sysctl toggle, both at input and
forward chains.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: failing fail-over - commit still in progress
2023-08-12 21:08 ` Pablo Neira Ayuso
@ 2023-08-21 6:19 ` Pierre-Philipp Braun
2023-08-21 9:26 ` Pablo Neira Ayuso
0 siblings, 1 reply; 10+ messages in thread
From: Pierre-Philipp Braun @ 2023-08-21 6:19 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: netfilter
> - active: internal cache contains the flow that represents the SSH
> connection.
> - backup: external cache contains the flow that represents the SSH
> connection.
I started from scratch a new PoC with two simple debian nodes and with
only three interfaces, which eventually let me do the drop policy.
Before, I could see the states being synced on the internal/external
cache. As far as I remember, I could see the states in the previous PoC
even if it was only doing NAT, without any filtering. Now it's even
worse. The backup node doesn't even see the states in its external
cache (both with FTFW/UDP and NOTRACK/UDP).
Are tracking rules in the filter table absolutely mandatory to make the
states known to conntrackd? I ask that because, conntrack -L can see
the local states without anything specific.
If so, does tracking rules initiated with nftables also work, or do I
have to use iptables instead?
If so, on which chains should I have absolutely have a drop policy
(input / forward / output)?
Is there a MWE with nftables rules somewhere that I could test?
> By "inbound session", I guess you refer to the SSH connection you use
> for testing, but is this a SSH connection to the guest VM? Is this
> DNAT to the guest VM or simply routing?
Yes, I was talking about a connection from the outside to a guest system
behind DNAT. Same goes for the new PoC, it's just that the VRRP nodes
are now guest systems themselves. To simplify the PoC (and have way
less network interfaces, no bonding, no bridges, no vlans), I've put the
gateways as guest and they now have only three interfaces.
eth0 -- front-facing
eth1 -- internal network
eth2 -- cluster network for the sync
so I could afford using a drop policy without too much headache.
> Such guess VM gets migrated to the active node and the active node
> forwards traffic to the guest VM?
No, my experiments so far have nothing to do with guest migrations. I
was only testing that the SSH connection remains during VRRP fail-over.
There are other interesting use-cases where we could consider using
conntrack-tools which are specific to virtualization, but that's a whole
different story. And given the problems I face with the most KISS setup
even with this MWE new PoC, I guess I would use something else for that
purpose (NetBSD PF+PFSYNC+CARP just works out-of-the-box and is
active/active by default, which is what I need for that other use-case).
> /proc/sys/net/netfilter/nf_conntrack_tcp_loose
Ok, that helps not to loose SSH the connection immediately, but still,
with the newer simple PoC I cannot even see the states replicated.
I also noticed this setting, is that required?
net.netfilter.nf_conntrack_helper = 0
It would be nice to have a fully working MWE tutorial available, to be
able to test the simplest active/passive setup. I will be glad to
document mine, if I finally manage to get it working.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: failing fail-over - commit still in progress
2023-08-21 6:19 ` Pierre-Philipp Braun
@ 2023-08-21 9:26 ` Pablo Neira Ayuso
2023-08-24 9:59 ` Pierre-Philipp Braun
0 siblings, 1 reply; 10+ messages in thread
From: Pablo Neira Ayuso @ 2023-08-21 9:26 UTC (permalink / raw)
To: Pierre-Philipp Braun; +Cc: netfilter
On Mon, Aug 21, 2023 at 09:19:21AM +0300, Pierre-Philipp Braun wrote:
> > - active: internal cache contains the flow that represents the SSH
> > connection.
> > - backup: external cache contains the flow that represents the SSH
> > connection.
>
> I started from scratch a new PoC with two simple debian nodes and with only
> three interfaces, which eventually let me do the drop policy.
>
> Before, I could see the states being synced on the internal/external cache.
> As far as I remember, I could see the states in the previous PoC even if it
> was only doing NAT, without any filtering. Now it's even worse. The backup
> node doesn't even see the states in its external cache (both with FTFW/UDP
> and NOTRACK/UDP).
>
> Are tracking rules in the filter table absolutely mandatory to make the
> states known to conntrackd? I ask that because, conntrack -L can see the
> local states without anything specific.
As I said before, you have to have a stateful ruleset which does not
pick up states from the middle.
> If so, does tracking rules initiated with nftables also work, or do I have
> to use iptables instead?
nftables is completely irrelevant in this picture. State
synchronization relies on ctnetlink and userspace conntrackd for state
synchronization. nftables is only the packet classification framework.
> If so, on which chains should I have absolutely have a drop policy (input /
> forward / output)?
>
> Is there a MWE with nftables rules somewhere that I could test?
>
> > By "inbound session", I guess you refer to the SSH connection you use
> > for testing, but is this a SSH connection to the guest VM? Is this
> > DNAT to the guest VM or simply routing?
>
> Yes, I was talking about a connection from the outside to a guest system
> behind DNAT. Same goes for the new PoC, it's just that the VRRP nodes are
> now guest systems themselves. To simplify the PoC (and have way less
> network interfaces, no bonding, no bridges, no vlans), I've put the gateways
> as guest and they now have only three interfaces.
>
> eth0 -- front-facing
> eth1 -- internal network
> eth2 -- cluster network for the sync
>
> so I could afford using a drop policy without too much headache.
Rule of thumb is: You have disable nf_conntrack_tcp_loose from
conntrack and a stateful ruleset which drops packets that are in
invalid state.
Otherwise, state synchronization does not make sense because conntrack
can pick connections from the middle, ie. you can implement "poor man"
failover and let conntrack recover the history from the middle.
> Ok, that helps not to loose SSH the connection immediately, but still, with
> the newer simple PoC I cannot even see the states replicated.
Can you see events on the active node with `conntrack -E`?
Did you debug with tcpdump on both ends to check to see if conntrackd
delivers the synchronization messages?
What do conntrackd stats tell you? There is a good number of options
that allow you debug your setup.
> I also noticed this setting, is that required?
>
> net.netfilter.nf_conntrack_helper = 0
How are conntrack helpers related to the issue you describe?
> It would be nice to have a fully working MWE tutorial available, to be able
> to test the simplest active/passive setup. I will be glad to document mine,
> if I finally manage to get it working.
Documentation is available here:
http://conntrack-tools.netfilter.org/manual.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: failing fail-over - commit still in progress
2023-08-21 9:26 ` Pablo Neira Ayuso
@ 2023-08-24 9:59 ` Pierre-Philipp Braun
2023-08-28 8:02 ` Pablo Neira Ayuso
0 siblings, 1 reply; 10+ messages in thread
From: Pierre-Philipp Braun @ 2023-08-24 9:59 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: netfilter
On 8/21/23 12:26, Pablo Neira Ayuso wrote:
> As I said before, you have to have a stateful ruleset which does not
> pick up states from the middle.
I am now filtering both interfaces, the front-facing and the internal one, on the FORWARD chain.
table inet filter {
chain input {
type filter hook input priority filter; policy accept;
}
chain forward {
type filter hook forward priority filter; policy drop;
ip protocol icmp accept
ct state invalid log prefix "INVALID: " drop
ct state established,related,new accept
log prefix "DROP POLICY: "
}
chain output {
type filter hook output priority filter; policy accept;
}
}
table ip nat {
chain postrouting {
type nat hook postrouting priority srcnat; policy accept;
ip saddr 10.1.0.0/16 oif "eth0" snat to 217.19.208.157
}
chain prerouting {
type nat hook prerouting priority dstnat; policy accept;
iif "eth0" tcp dport 50 dnat to 10.1.0.50:22
}
}
However it looks like I am still tracking in the middle.
> nftables is completely irrelevant in this picture. State
> synchronization relies on ctnetlink and userspace conntrackd for state
> synchronization. nftables is only the packet classification framework.
I was just wondering if I absolutely had to use the iptables example from the testcase sample.
I notice that example has additional SYN flag.
Basically I am doing things the other way around, DNAT instead of SNAT.
> Rule of thumb is: You have disable nf_conntrack_tcp_loose from
> conntrack and a stateful ruleset which drops packets that are in
> invalid state.
tcp_loose is zero and as for the stateful ruleset, I am still not sure.
> Otherwise, state synchronization does not make sense because conntrack
> can pick connections from the middle, ie. you can implement "poor man"
> failover and let conntrack recover the history from the middle.
This seems to be what is happening (correct me if I am wrong). I sometimes manage to successfully fail-over after checking carefully that things are in order, namely that the state is there on internal and external cache respectively. Anyhow the commit error remains.
> Can you see events on the active node with `conntrack -E`?
No, and I have to restart conntrackd to actually be able to see the states (I guess thanks to `StartupResync on`).
> Did you debug with tcpdump on both ends to check to see if conntrackd
> delivers the synchronization messages?
Yes I noticed an UDP datagram that is larger than the other ones.
> What do conntrackd stats tell you? There is a good number of options
> that allow you debug your setup.
right after a successful fail-over
root@vrrp1:~# conntrackd -s
cache internal:
current active connections: 2
connections created: 4 failed: 0
connections updated: 0 failed: 0
connections destroyed: 2 failed: 0
cache external:
current active connections: 2
connections created: 4 failed: 0
connections updated: 0 failed: 0
connections destroyed: 2 failed: 0
traffic processed:
0 Bytes 0 Pckts
UDP traffic (active device=eth2):
8384 Bytes sent 8216 Bytes recv
499 Pckts sent 498 Pckts recv
0 Error send 0 Error recv
message tracking:
0 Malformed msgs 0 Lost msgs
root@vrrp2:~# conntrackd -s
cache internal:
current active connections: 2
connections created: 2 failed: 0
connections updated: 0 failed: 0
connections destroyed: 0 failed: 0
cache external:
current active connections: 2
connections created: 6 failed: 0
connections updated: 0 failed: 0
connections destroyed: 4 failed: 0
traffic processed:
0 Bytes 0 Pckts
UDP traffic (active device=eth2):
8552 Bytes sent 8704 Bytes recv
519 Pckts sent 519 Pckts recv
0 Error send 0 Error recv
message tracking:
0 Malformed msgs 0 Lost msgs
but the commit error is always there in the logs.
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: failing fail-over - commit still in progress
2023-08-24 9:59 ` Pierre-Philipp Braun
@ 2023-08-28 8:02 ` Pablo Neira Ayuso
[not found] ` <f1291caf-2103-3fcb-7e60-e5a3218624ad@nethence.com>
0 siblings, 1 reply; 10+ messages in thread
From: Pablo Neira Ayuso @ 2023-08-28 8:02 UTC (permalink / raw)
To: Pierre-Philipp Braun; +Cc: netfilter
On Thu, Aug 24, 2023 at 12:59:47PM +0300, Pierre-Philipp Braun wrote:
> On 8/21/23 12:26, Pablo Neira Ayuso wrote:
> > As I said before, you have to have a stateful ruleset which does not
> > pick up states from the middle.
>
> I am now filtering both interfaces, the front-facing and the internal one, on the FORWARD chain.
>
> table inet filter {
> chain input {
> type filter hook input priority filter; policy accept;
> }
>
> chain forward {
> type filter hook forward priority filter; policy drop;
> ip protocol icmp accept
> ct state invalid log prefix "INVALID: " drop
> ct state established,related,new accept
> log prefix "DROP POLICY: "
> }
>
> chain output {
> type filter hook output priority filter; policy accept;
> }
> }
> table ip nat {
> chain postrouting {
> type nat hook postrouting priority srcnat; policy accept;
> ip saddr 10.1.0.0/16 oif "eth0" snat to 217.19.208.157
> }
>
> chain prerouting {
> type nat hook prerouting priority dstnat; policy accept;
> iif "eth0" tcp dport 50 dnat to 10.1.0.50:22
> }
> }
>
> However it looks like I am still tracking in the middle.
>
> > nftables is completely irrelevant in this picture. State
> > synchronization relies on ctnetlink and userspace conntrackd for state
> > synchronization. nftables is only the packet classification framework.
>
> I was just wondering if I absolutely had to use the iptables example from the testcase sample.
> I notice that example has additional SYN flag.
> Basically I am doing things the other way around, DNAT instead of SNAT.
>
> > Rule of thumb is: You have disable nf_conntrack_tcp_loose from
> > conntrack and a stateful ruleset which drops packets that are in
> > invalid state.
>
> tcp_loose is zero and as for the stateful ruleset, I am still not sure.
>
> > Otherwise, state synchronization does not make sense because conntrack
> > can pick connections from the middle, ie. you can implement "poor man"
> > failover and let conntrack recover the history from the middle.
>
> This seems to be what is happening (correct me if I am wrong). I sometimes manage to successfully fail-over after checking carefully that things are in order, namely that the state is there on internal and external cache respectively. Anyhow the commit error remains.
>
> > Can you see events on the active node with `conntrack -E`?
>
> No, and I have to restart conntrackd to actually be able to see the states (I guess thanks to `StartupResync on`).
Did you enable CONFIG_NF_CONNTRACK_EVENTS in your kernel?
CONFIG_NF_CONNTRACK_EVENTS=y
`conntrack -E' should show events regardless your conntrackd
configuration when you create new flows.
> > Did you debug with tcpdump on both ends to check to see if conntrackd
> > delivers the synchronization messages?
>
> Yes I noticed an UDP datagram that is larger than the other ones.
You should see UDP traffic in port 3780, unless you have changed your
Port in your conntrackd.conf configuration file.
> > What do conntrackd stats tell you? There is a good number of options
> > that allow you debug your setup.
>
> right after a successful fail-over
>
> root@vrrp1:~# conntrackd -s
> cache internal:
> current active connections: 2
> connections created: 4 failed: 0
> connections updated: 0 failed: 0
> connections destroyed: 2 failed: 0
>
> cache external:
> current active connections: 2
> connections created: 4 failed: 0
> connections updated: 0 failed: 0
> connections destroyed: 2 failed: 0
>
> traffic processed:
> 0 Bytes 0 Pckts
>
> UDP traffic (active device=eth2):
> 8384 Bytes sent 8216 Bytes recv
> 499 Pckts sent 498 Pckts recv
> 0 Error send 0 Error recv
>
> message tracking:
> 0 Malformed msgs 0 Lost msgs
>
> root@vrrp2:~# conntrackd -s
> cache internal:
> current active connections: 2
> connections created: 2 failed: 0
> connections updated: 0 failed: 0
> connections destroyed: 0 failed: 0
>
> cache external:
> current active connections: 2
> connections created: 6 failed: 0
> connections updated: 0 failed: 0
> connections destroyed: 4 failed: 0
>
> traffic processed:
> 0 Bytes 0 Pckts
>
> UDP traffic (active device=eth2):
> 8552 Bytes sent 8704 Bytes recv
> 519 Pckts sent 519 Pckts recv
> 0 Error send 0 Error recv
>
> message tracking:
> 0 Malformed msgs 0 Lost msgs
>
> but the commit error is always there in the logs.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-09-01 8:37 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-11 8:55 failing fail-over - commit still in progress Pierre-Philipp Braun
2023-08-11 8:58 ` Pierre-Philipp Braun
2023-08-11 10:53 ` Pablo Neira Ayuso
2023-08-12 9:52 ` Pierre-Philipp Braun
2023-08-12 21:08 ` Pablo Neira Ayuso
2023-08-21 6:19 ` Pierre-Philipp Braun
2023-08-21 9:26 ` Pablo Neira Ayuso
2023-08-24 9:59 ` Pierre-Philipp Braun
2023-08-28 8:02 ` Pablo Neira Ayuso
[not found] ` <f1291caf-2103-3fcb-7e60-e5a3218624ad@nethence.com>
2023-09-01 8:37 ` Pablo Neira Ayuso
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.