failing fail-over - commit still in progress

All of lore.kernel.org
 help / color / mirror / Atom feed

* failing fail-over - commit still in progress
@ 2023-08-11  8:55 Pierre-Philipp Braun
  2023-08-11  8:58 ` Pierre-Philipp Braun
  2023-08-11 10:53 ` Pablo Neira Ayuso
  0 siblings, 2 replies; 10+ messages in thread
From: Pierre-Philipp Braun @ 2023-08-11  8:55 UTC (permalink / raw)
  To: netfilter

Hello

I have a casual NAT active/passive setup with keepalived+conntrackd, on three nodes.  I am trying to validate a fail-over on inbound traffic: an open SSH connection, initiated from the outside, taking advantage of a DNAT rule that points to a system behind the NAT.

Here for example from within the ssh session:

tcp        0     52 10.1.0.50:22            178.205.50.68:27531     ESTABLISHED 288/sshd: root@pts/

I can see the state from the active node:

	# internal cache
tcp      6 ESTABLISHED src=178.205.50.68 dst=217.19.208.157 sport=27531 dport=50 src=10.1.0.50 dst=178.205.50.68 sport=22 dport=27531 [ASSURED] [active since 237s]

it's absent on node2, as we are in active/passive mode.

	# external cache
	tcp      6 ESTABLISHED src=178.205.50.68 dst=10.1.0.50 sport=27531 dport=22 [ASSURED] [active since 403s]

I can also see it on node3, although I did not disable external caches:

	# internal cache
tcp      6 ESTABLISHED src=178.205.50.68 dst=10.1.0.50 sport=27531 dport=22 src=10.1.0.50 dst=178.205.50.68 sport=22 dport=27531 [ASSURED] [active since 217s]

	# external cache
	(not there)

Why?  Because node1,2,3 are XEN virtual machine monitors that actually host guests, aside from serving NAT for them.

So here we go, this is what happens when I kill keepalived on the active node (currently node1).
node2 shows:

[Fri Aug 11 11:41:59 2023] (pid=14642) [notice] committing all external caches
[Fri Aug 11 11:41:59 2023] (pid=14642) [notice] Committed 71 new entries
[Fri Aug 11 11:41:59 2023] (pid=14642) [notice] commit has taken 0.000558 seconds
[Fri Aug 11 11:41:59 2023] (pid=14642) [notice] flushing conntrack table in 60 secs
[Fri Aug 11 11:41:59 2023] (pid=14642) [ERROR] ignoring flush command, commit still in progress
[Fri Aug 11 11:41:59 2023] (pid=14642) [notice] resync requested
[Fri Aug 11 11:41:59 2023] (pid=14642) [notice] resync with master conntrack table
[Fri Aug 11 11:41:59 2023] (pid=14642) [notice] sending bulk update
[Fri Aug 11 11:42:59 2023] (pid=14642) [notice] flushing kernel conntrack table (scheduled)

and node3 shows:

[Fri Aug 11 11:41:59 2023] (pid=25228) [notice] committing all external caches
[Fri Aug 11 11:41:59 2023] (pid=25228) [notice] Committed 3 new entries
[Fri Aug 11 11:41:59 2023] (pid=25228) [notice] commit has taken 0.000069 seconds
[Fri Aug 11 11:41:59 2023] (pid=25228) [ERROR] ignoring flush command, commit still in progress
[Fri Aug 11 11:41:59 2023] (pid=25228) [notice] resync with master conntrack table
[Fri Aug 11 11:41:59 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:41:59 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:41:59 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:42:00 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:42:00 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:42:01 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:42:01 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:42:02 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:42:02 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:42:03 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:42:03 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:42:04 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:42:04 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:42:05 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:42:05 2023] (pid=25228) [notice] sending bulk update
[Fri Aug 11 11:42:06 2023] (pid=25228) [notice] resync requested by other node
[Fri Aug 11 11:42:06 2023] (pid=25228) [notice] sending bulk update
...

When I try to commit manually, it doesn't say another commit is in progress.
But since -c ends once it finishes, I guess that means there's either some conflicting commits going on (I don't see where, as keepalived only calls the primary script once on the new active node)
--or-- something related my network setup and eventually the discrepancy noticed above (known state on the backup) makes it so that there's a conflict.

versions:

Linux 5.16.20
nftables v1.0.1 (Fearless Fosdick #3)
Keepalived v2.2.8
Connection tracking userspace daemon v1.4.7 (GIT master branch)

nftables.conf:

define nic=xenbr0
define gst=guestbr0

table inet filter
flush table inet filter
table inet filter {
         chain input {
                 type filter hook input priority filter; policy accept;

                 ip protocol icmp accept
                 ip6 nexthdr ipv6-icmp accept
                 #ip protocol vrrp ip daddr 224.0.0.0/8 accept
                 ip protocol vrrp accept

                 #iif $nic tcp dport 1-3000 accept
                 #iif $nic tcp dport 64999 accept

                 # conntrackd wants drop
                 #iif $nic ct state established,related accept
                 #iif $nic drop

                 #iif $gst ct state established,related accept
                 #iif $gst drop
         }

         # NAT --> accept
         chain forward {
                 type filter hook forward priority filter; policy accept;
         }

         chain output {
                 type filter hook output priority filter; policy accept;

                 ip protocol icmp accept
                 ip6 nexthdr ipv6-icmp accept
                 #ip protocol vrrp ip saddr 224.0.0.0/8 accept
                 ip protocol vrrp accept

                 # conntrack wants drop
                 #oif $gst ct state established,related accept
                 #oif $gst drop
         }
}

table ip nat
flush table ip nat
table ip nat {
         chain postrouting {
                 type nat hook postrouting priority srcnat;
                 ip saddr 10.1.0.0/16 oif $nic snat 217.19.208.154;
                 #ip saddr 10.1.0.0/16 oif $nic snat 217.19.208.157;
         }

         chain prerouting {
                 type nat hook prerouting priority dstnat;

		...
                 iif $nic tcp dport 50 dnat 10.1.0.50:22;
		...
         }
}

keepalived.conf:

         max_auto_priority -1

         notification_email {
                 support@angrycow.ru
         }

         notification_email_from support@angrycow.ru
         checker_log_all_failures
         default_interface xenbr0

         # need root for conntrackd
         #enable_script_security
         #script_user keepalive keepalive
}

vrrp_sync_group nat {
         group {
                 front-vip
                 guest-vip
         }

         # active/passive
         notify_master   "/etc/conntrackd/primary-backup.bash primary"
         notify_backup   "/etc/conntrackd/primary-backup.bash backup"
         notify_fault    "/etc/conntrackd/primary-backup.bash fault"

         # active/active
         #notify "/var/tmp/notify.bash"
}

vrrp_instance front-vip {
         state BACKUP
         interface xenbr0
         virtual_router_id 1
         priority 1
         advert_int 1

         virtual_ipaddress {
                 217.19.208.157/29
         }
         # default route remains anyhow

         notify "/var/tmp/notify.bash"
}

vrrp_instance guest-vip {
         state BACKUP
         interface guestbr0
         virtual_router_id 2
         priority 1
         advert_int 1

         virtual_ipaddress {
                 10.1.255.254/16
         }

         notify "/var/tmp/notify.bash"
}

==> same on all nodes, letting vrrp do its own election...

conntrackd.conf:

Sync {
         Mode FTFW {
                 # casual fail-over - active/passive
                 DisableExternalCache off

                 # active/active
                 #DisableExternalCache on

                 # grab states from the past
                 StartupResync on
         }

         UDP {
IPv4_address 10.3.3.1
                 IPv4_Destination_Address 10.3.3.2
                 IPv4_Destination_Address 10.3.3.3
                 Port 3780
                 Interface br0
                 SndSocketBuffer 1249280
                 RcvSocketBuffer 1249280
                 Checksum on
         }
}

General {
         Systemd off
         HashSize 8192
         # 2 x /proc/sys/net/netfilter/nf_conntrack_max
         HashLimit 131072
         LogFile on
         Syslog off
         LockFile /var/lock/conntrack.lock

         NetlinkBufferSize 2097152
         NetlinkBufferSizeMaxGrowth 8388608


         UNIX {
                 Path /var/run/conntrackd.ctl
         }

         Filter {
                 Protocol Accept {
                         TCP
                         #SCTP
                         #UDP
                         #ICMP
                 }

                 Address Ignore {
                         IPv4_address 127.0.0.1
                         IPv6_address ::1

                         # don't track cluster/storage network
                         IPv4_address 10.3.3.0/24
                      }

                 State Accept {
                         ESTABLISHED CLOSED TIME_WAIT CLOSE_WAIT for TCP
                 }
         }
}

It's been hard to troubleshoot, I don't see what's wrong in my setup, please advise.

BR
-elge

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: failing fail-over - commit still in progress
  2023-08-11  8:55 failing fail-over - commit still in progress Pierre-Philipp Braun
@ 2023-08-11  8:58 ` Pierre-Philipp Braun
  2023-08-11 10:53 ` Pablo Neira Ayuso
  1 sibling, 0 replies; 10+ messages in thread
From: Pierre-Philipp Braun @ 2023-08-11  8:58 UTC (permalink / raw)
  To: netfilter

> nftables.conf:

==> same on all nodes but the outbound external IP
I also tried with the external vrrp vip itself but that won't change 
anything, as we're considering inbound traffic fail-over issue.

> keepalived.conf:
> ...
> ==> same on all nodes, letting vrrp do its own election...

> conntrackd.conf:

==> same on all nodes but the UDP IPv4_address bind and destinations.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: failing fail-over - commit still in progress
  2023-08-11  8:55 failing fail-over - commit still in progress Pierre-Philipp Braun
  2023-08-11  8:58 ` Pierre-Philipp Braun
@ 2023-08-11 10:53 ` Pablo Neira Ayuso
  2023-08-12  9:52   ` Pierre-Philipp Braun
  1 sibling, 1 reply; 10+ messages in thread
From: Pablo Neira Ayuso @ 2023-08-11 10:53 UTC (permalink / raw)
  To: Pierre-Philipp Braun; +Cc: netfilter

On Fri, Aug 11, 2023 at 11:55:42AM +0300, Pierre-Philipp Braun wrote:
> Hello
> 
> I have a casual NAT active/passive setup with keepalived+conntrackd,
> on three nodes.

Three nodes and FT-FW mode will not work. FT-FW would need to be
extended to maintain sequence tracking for more than one single node.
It is doable but this requires development effort.

For three node, you should try NOTRACK which means sync messages are
sent from active to passive nodes without any kind of sequence
tracking (best effort approach).

[...]
> versions:
> 
> Linux 5.16.20

BTW, why this kernel version? This is not any of the -stable kernels.

> nftables v1.0.1 (Fearless Fosdick #3)
> Keepalived v2.2.8
> Connection tracking userspace daemon v1.4.7 (GIT master branch)
> 
> nftables.conf:
> 
> define nic=xenbr0
> define gst=guestbr0
> 
> table inet filter
> flush table inet filter
> table inet filter {
>         chain input {
>                 type filter hook input priority filter; policy accept;
> 
>                 ip protocol icmp accept
>                 ip6 nexthdr ipv6-icmp accept
>                 #ip protocol vrrp ip daddr 224.0.0.0/8 accept
>                 ip protocol vrrp accept

                  meta l4proto { icmp, ipv6-icmp, vrrp } accept

BTW, you could merge these rules with a set, to have a less iptabl-ish
ruleset.

With newer nftables version, I recommend to run -o/--optimization
option to check for ruleset optimizations.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: failing fail-over - commit still in progress
  2023-08-11 10:53 ` Pablo Neira Ayuso
@ 2023-08-12  9:52   ` Pierre-Philipp Braun
  2023-08-12 21:08     ` Pablo Neira Ayuso
  0 siblings, 1 reply; 10+ messages in thread
From: Pierre-Philipp Braun @ 2023-08-12  9:52 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter


> Three nodes and FT-FW mode will not work. FT-FW would need to be
> extended to maintain sequence tracking for more than one single node.
> It is doable but this requires development effort.
> 
> For three node, you should try NOTRACK which means sync messages are
> sent from active to passive nodes without any kind of sequence
> tracking (best effort approach).

I switched to NOTRACK UDP but I get the same issue with the commit.

The inbound session is seen alright on all the nodes, although node3 (active vrrp) sees it both in internal and external cache.
The host where the guest lives sees it only in the internal cache this time.

(vrrp backup - here lives the guest)
pmr1: conntrack v1.4.7 (conntrack-tools): 203 flow entries have been shown.
pmr1: tcp      6 431976 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr1: internal cache
pmr1: tcp      6 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] [active since 60s]
pmr1: external cache

(vrrp backup)
pmr2: conntrack v1.4.7 (conntrack-tools): 139 flow entries have been shown.
pmr2: internal cache
pmr2: external cache
pmr2: tcp      6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 [ASSURED] [active since 60s]

(active vrrp)
pmr3: tcp      6 431976 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr3: conntrack v1.4.7 (conntrack-tools): 140 flow entries have been shown.
pmr3: internal cache
pmr3: tcp      6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED]
[active since 60s]
pmr3: external cache
pmr3: tcp      6 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 [ASSURED] [active since 60s]

going for the acceptance test, node2 because active and that's a success (for once) - I didn't loose SSH connection to the guest system.
those are the states after fail-over to node2.

(backup vrrp - guest lives there)
pmr1: conntrack v1.4.7 (conntrack-tools): 198 flow entries have been shown.
pmr1: tcp      6 431992 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr1: internal cache
pmr1: tcp      6 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] [active since 187s]
pmr1: external cache

(active vrrp)
pmr2: conntrack v1.4.7 (conntrack-tools): 148 flow entries have been shown.
pmr2: tcp      6 431992 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr2: internal cache
pmr2: tcp      6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED]
mark=0 [active since 257s]
pmr2: external cache
pmr2: tcp      6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 [ASSURED] [active since 636s]

(backup vrrp)
pmr3: tcp      6 431692 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr3: conntrack v1.4.7 (conntrack-tools): 124 flow entries have been shown.
pmr3: internal cache
pmr3: tcp      6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED]
[active since 636s]
pmr3: external cache
pmr3: tcp      6 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 [ASSURED] [active since 187s]

let us do it again!  Here we go, node1 became master and I lost connection.

node1 shows

[Sat Aug 12 12:43:03 2023] (pid=24942) [notice] committing all external caches
[Sat Aug 12 12:43:03 2023] (pid=24942) [notice] Committed 0 new entries
[Sat Aug 12 12:43:03 2023] (pid=24942) [notice] commit has taken 0.000017 seconds
[Sat Aug 12 12:43:03 2023] (pid=24942) [ERROR] ignoring flush command, commit still in progress
[Sat Aug 12 12:43:03 2023] (pid=24942) [notice] resync with master conntrack table
[Sat Aug 12 12:43:03 2023] (pid=24942) [notice] sending bulk update

node2 shows

[Sat Aug 12 12:37:51 2023] (pid=20216) [notice] committing all external caches
[Sat Aug 12 12:37:51 2023] (pid=20216) [ERROR] commit-create: File exists
Sat Aug 12 12:37:51 2023        tcp      6 60 SYN_RECV src=8.222.205.118 dst=10.1.0.11 sport=39230 dport=22
[Sat Aug 12 12:37:51 2023] (pid=20216) [notice] Committed 38 new entries
[Sat Aug 12 12:37:51 2023] (pid=20216) [notice] 1 entries can't be committed
[Sat Aug 12 12:37:51 2023] (pid=20216) [notice] commit has taken 0.000560 seconds
[Sat Aug 12 12:37:51 2023] (pid=20216) [ERROR] ignoring flush command, commit still in progress
[Sat Aug 12 12:37:51 2023] (pid=20216) [notice] resync with master conntrack table
[Sat Aug 12 12:37:51 2023] (pid=20216) [notice] sending bulk update

node3 shows

[Sat Aug 12 12:37:51 2023] (pid=21424) [notice] resync requested by other node
[Sat Aug 12 12:37:51 2023] (pid=21424) [notice] sending bulk update

states referring to that previously used port (57995/tcp) are as follows

(active vrrp and guest lives there)
pmr1: conntrack v1.4.7 (conntrack-tools): 198 flow entries have been shown.
pmr1: tcp      6 431958 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr1: internal cache
pmr1: tcp      6 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 [active since 279s]
pmr1: external cache

(backup vrrp)
pmr2: conntrack v1.4.7 (conntrack-tools): 136 flow entries have been shown.
pmr2: tcp      6 431958 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr2: internal cache
pmr2: tcp      6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED]
mark=0 [active since 349s]
pmr2: external cache
pmr2: tcp      6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 [ASSURED] [active since 728s]

(backup vrrp)
pmr3: tcp      6 431600 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED] mark=0 use=1
pmr3: conntrack v1.4.7 (conntrack-tools): 133 flow entries have been shown.
pmr3: internal cache
pmr3: tcp      6 ESTABLISHED src=176.59.99.113 dst=217.19.208.157 sport=57995 dport=50 src=10.1.0.50 dst=176.59.99.113 sport=22 dport=57995 [ASSURED]
[active since 728s]
pmr3: external cache
pmr3: tcp      6 ESTABLISHED src=176.59.99.113 dst=10.1.0.50 sport=57995 dport=22 [ASSURED] mark=0 [active since 279s]

I think it's not necessarily because of the three-node setup (I tried with two nodes and afair I got same commit issue).

>> Linux 5.16.20
> 
> BTW, why this kernel version? This is not any of the -stable kernels.

Because latest REISER4 *1 patch is for 5.16.  I downgraded to linux longterm 5.15 for the purpose of the tests tho, to avoid having anything too exotic.
The cluster farm is currently running linux 5.15.126 + drbd v9.2.5 module.

*1 https://lab.nethence.com/fsbench/2022-10.html

> BTW, you could merge these rules with a set, to have a less iptabl-ish
> ruleset.
> 
> With newer nftables version, I recommend to run -o/--optimization
> option to check for ruleset optimizations.

yup, thanks for the tip

In a further threat I would describe the issues I have if I switch to active/active mode by disabling external caches.
I've got another symptoms in that scenario.

-elge

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: failing fail-over - commit still in progress
  2023-08-12  9:52   ` Pierre-Philipp Braun
@ 2023-08-12 21:08     ` Pablo Neira Ayuso
  2023-08-21  6:19       ` Pierre-Philipp Braun
  0 siblings, 1 reply; 10+ messages in thread
From: Pablo Neira Ayuso @ 2023-08-12 21:08 UTC (permalink / raw)
  To: Pierre-Philipp Braun; +Cc: netfilter

On Sat, Aug 12, 2023 at 12:52:19PM +0300, Pierre-Philipp Braun wrote:
> 
> > Three nodes and FT-FW mode will not work. FT-FW would need to be
> > extended to maintain sequence tracking for more than one single node.
> > It is doable but this requires development effort.
> > 
> > For three node, you should try NOTRACK which means sync messages are
> > sent from active to passive nodes without any kind of sequence
> > tracking (best effort approach).
> 
> I switched to NOTRACK UDP but I get the same issue with the commit.
> 
> The inbound session is seen alright on all the nodes, although node3 (active vrrp) sees it both in internal and external cache.
> The host where the guest lives sees it only in the internal cache this time.

You should see:

- active: internal cache contains the flow that represents the SSH
  connection.
- backup: external cache contains the flow that represents the SSH
  connection.

on failover, what you see in the external cache in the backup node
will be visible in the internal cache.

By "inbound session", I guess you refer to the SSH connection you use
for testing, but is this a SSH connection to the guest VM? Is this
DNAT to the guest VM or simply routing?

Such guess VM gets migrated to the active node and the active node
forwards traffic to the guest VM?

From what you write, there is no state synchronization issue with
NOTRACK with three nodes.

If connection gets lost on failover, it might be also be related to
your firewall policy. If the state is not yet in conntrack, NAT
packets will be handled as local packet by the router, not the guess
itself, likely being rejecting them with TCP RST.

Dropping packets that are in invalid state is important to make sure
no races occur with state injection, your basechain policy is also set
to accept as default.

Please also check that you set:

/proc/sys/net/netfilter/nf_conntrack_tcp_loose

to zero to disable TCP connection tracking pick up on failover.
Otherwise, conntrack creates an entry from the middle.

Moreover, you will need to drop packets in invalid state in your
policy in combination with this sysctl toggle, both at input and
forward chains.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: failing fail-over - commit still in progress
  2023-08-12 21:08     ` Pablo Neira Ayuso
@ 2023-08-21  6:19       ` Pierre-Philipp Braun
  2023-08-21  9:26         ` Pablo Neira Ayuso
  0 siblings, 1 reply; 10+ messages in thread
From: Pierre-Philipp Braun @ 2023-08-21  6:19 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter

> - active: internal cache contains the flow that represents the SSH
>    connection.
> - backup: external cache contains the flow that represents the SSH
>    connection.

I started from scratch a new PoC with two simple debian nodes and with 
only three interfaces, which eventually let me do the drop policy.

Before, I could see the states being synced on the internal/external 
cache.  As far as I remember, I could see the states in the previous PoC 
even if it was only doing NAT, without any filtering.  Now it's even 
worse.  The backup node doesn't even see the states in its external 
cache (both with FTFW/UDP and NOTRACK/UDP).

Are tracking rules in the filter table absolutely mandatory to make the 
states known to conntrackd?  I ask that because, conntrack -L can see 
the local states without anything specific.

If so, does tracking rules initiated with nftables also work, or do I 
have to use iptables instead?

If so, on which chains should I have absolutely have a drop policy 
(input / forward / output)?

Is there a MWE with nftables rules somewhere that I could test?

> By "inbound session", I guess you refer to the SSH connection you use
> for testing, but is this a SSH connection to the guest VM? Is this
> DNAT to the guest VM or simply routing?

Yes, I was talking about a connection from the outside to a guest system 
behind DNAT.  Same goes for the new PoC, it's just that the VRRP nodes 
are now guest systems themselves.  To simplify the PoC (and have way 
less network interfaces, no bonding, no bridges, no vlans), I've put the 
gateways as guest and they now have only three interfaces.

eth0 -- front-facing
eth1 -- internal network
eth2 -- cluster network for the sync

so I could afford using a drop policy without too much headache.

> Such guess VM gets migrated to the active node and the active node
> forwards traffic to the guest VM?

No, my experiments so far have nothing to do with guest migrations.  I 
was only testing that the SSH connection remains during VRRP fail-over.

There are other interesting use-cases where we could consider using 
conntrack-tools which are specific to virtualization, but that's a whole 
different story.  And given the problems I face with the most KISS setup 
even with this MWE new PoC, I guess I would use something else for that 
purpose (NetBSD PF+PFSYNC+CARP just works out-of-the-box and is 
active/active by default, which is what I need for that other use-case).

> /proc/sys/net/netfilter/nf_conntrack_tcp_loose

Ok, that helps not to loose SSH the connection immediately, but still, 
with the newer simple PoC I cannot even see the states replicated.

I also noticed this setting, is that required?

net.netfilter.nf_conntrack_helper = 0

It would be nice to have a fully working MWE tutorial available, to be 
able to test the simplest active/passive setup.  I will be glad to 
document mine, if I finally manage to get it working.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: failing fail-over - commit still in progress
  2023-08-21  6:19       ` Pierre-Philipp Braun
@ 2023-08-21  9:26         ` Pablo Neira Ayuso
  2023-08-24  9:59           ` Pierre-Philipp Braun
  0 siblings, 1 reply; 10+ messages in thread
From: Pablo Neira Ayuso @ 2023-08-21  9:26 UTC (permalink / raw)
  To: Pierre-Philipp Braun; +Cc: netfilter

On Mon, Aug 21, 2023 at 09:19:21AM +0300, Pierre-Philipp Braun wrote:
> > - active: internal cache contains the flow that represents the SSH
> >    connection.
> > - backup: external cache contains the flow that represents the SSH
> >    connection.
> 
> I started from scratch a new PoC with two simple debian nodes and with only
> three interfaces, which eventually let me do the drop policy.
> 
> Before, I could see the states being synced on the internal/external cache.
> As far as I remember, I could see the states in the previous PoC even if it
> was only doing NAT, without any filtering.  Now it's even worse.  The backup
> node doesn't even see the states in its external cache (both with FTFW/UDP
> and NOTRACK/UDP).
> 
> Are tracking rules in the filter table absolutely mandatory to make the
> states known to conntrackd?  I ask that because, conntrack -L can see the
> local states without anything specific.

As I said before, you have to have a stateful ruleset which does not
pick up states from the middle.

> If so, does tracking rules initiated with nftables also work, or do I have
> to use iptables instead?

nftables is completely irrelevant in this picture. State
synchronization relies on ctnetlink and userspace conntrackd for state
synchronization. nftables is only the packet classification framework.

> If so, on which chains should I have absolutely have a drop policy (input /
> forward / output)?
> 
> Is there a MWE with nftables rules somewhere that I could test?
>
> > By "inbound session", I guess you refer to the SSH connection you use
> > for testing, but is this a SSH connection to the guest VM? Is this
> > DNAT to the guest VM or simply routing?
> 
> Yes, I was talking about a connection from the outside to a guest system
> behind DNAT.  Same goes for the new PoC, it's just that the VRRP nodes are
> now guest systems themselves.  To simplify the PoC (and have way less
> network interfaces, no bonding, no bridges, no vlans), I've put the gateways
> as guest and they now have only three interfaces.
> 
> eth0 -- front-facing
> eth1 -- internal network
> eth2 -- cluster network for the sync
> 
> so I could afford using a drop policy without too much headache.

Rule of thumb is: You have disable nf_conntrack_tcp_loose from
conntrack and a stateful ruleset which drops packets that are in
invalid state.

Otherwise, state synchronization does not make sense because conntrack
can pick connections from the middle, ie. you can implement "poor man"
failover and let conntrack recover the history from the middle.

> Ok, that helps not to loose SSH the connection immediately, but still, with
> the newer simple PoC I cannot even see the states replicated.

Can you see events on the active node with `conntrack -E`?

Did you debug with tcpdump on both ends to check to see if conntrackd
delivers the synchronization messages?

What do conntrackd stats tell you? There is a good number of options
that allow you debug your setup.

> I also noticed this setting, is that required?
> 
> net.netfilter.nf_conntrack_helper = 0

How are conntrack helpers related to the issue you describe?

> It would be nice to have a fully working MWE tutorial available, to be able
> to test the simplest active/passive setup.  I will be glad to document mine,
> if I finally manage to get it working.

Documentation is available here:

http://conntrack-tools.netfilter.org/manual.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: failing fail-over - commit still in progress
  2023-08-21  9:26         ` Pablo Neira Ayuso
@ 2023-08-24  9:59           ` Pierre-Philipp Braun
  2023-08-28  8:02             ` Pablo Neira Ayuso
  0 siblings, 1 reply; 10+ messages in thread
From: Pierre-Philipp Braun @ 2023-08-24  9:59 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter

On 8/21/23 12:26, Pablo Neira Ayuso wrote:
> As I said before, you have to have a stateful ruleset which does not
> pick up states from the middle.

I am now filtering both interfaces, the front-facing and the internal one, on the FORWARD chain.

table inet filter {
         chain input {
                 type filter hook input priority filter; policy accept;
         }

         chain forward {
                 type filter hook forward priority filter; policy drop;
                 ip protocol icmp accept
                 ct state invalid log prefix "INVALID: " drop
                 ct state established,related,new accept
                 log prefix "DROP POLICY: "
         }

         chain output {
                 type filter hook output priority filter; policy accept;
         }
}
table ip nat {
         chain postrouting {
                 type nat hook postrouting priority srcnat; policy accept;
                 ip saddr 10.1.0.0/16 oif "eth0" snat to 217.19.208.157
         }

         chain prerouting {
                 type nat hook prerouting priority dstnat; policy accept;
                 iif "eth0" tcp dport 50 dnat to 10.1.0.50:22
         }
}

However it looks like I am still tracking in the middle.

> nftables is completely irrelevant in this picture. State
> synchronization relies on ctnetlink and userspace conntrackd for state
> synchronization. nftables is only the packet classification framework.

I was just wondering if I absolutely had to use the iptables example from the testcase sample.
I notice that example has additional SYN flag.
Basically I am doing things the other way around, DNAT instead of SNAT.

> Rule of thumb is: You have disable nf_conntrack_tcp_loose from
> conntrack and a stateful ruleset which drops packets that are in
> invalid state.

tcp_loose is zero and as for the stateful ruleset, I am still not sure.

> Otherwise, state synchronization does not make sense because conntrack
> can pick connections from the middle, ie. you can implement "poor man"
> failover and let conntrack recover the history from the middle.

This seems to be what is happening (correct me if I am wrong).  I sometimes manage to successfully fail-over after checking carefully that things are in order, namely that the state is there on internal and external cache respectively.  Anyhow the commit error remains.

> Can you see events on the active node with `conntrack -E`?

No, and I have to restart conntrackd to actually be able to see the states (I guess thanks to `StartupResync on`).

> Did you debug with tcpdump on both ends to check to see if conntrackd
> delivers the synchronization messages?

Yes I noticed an UDP datagram that is larger than the other ones.

> What do conntrackd stats tell you? There is a good number of options
> that allow you debug your setup.

right after a successful fail-over

root@vrrp1:~# conntrackd -s
cache internal:
current active connections:                2
connections created:                       4    failed:            0
connections updated:                       0    failed:            0
connections destroyed:                     2    failed:            0

cache external:
current active connections:                2
connections created:                       4    failed:            0
connections updated:                       0    failed:            0
connections destroyed:                     2    failed:            0

traffic processed:
                    0 Bytes                         0 Pckts

UDP traffic (active device=eth2):
                 8384 Bytes sent                 8216 Bytes recv
                  499 Pckts sent                  498 Pckts recv
                    0 Error send                    0 Error recv

message tracking:
                    0 Malformed msgs                    0 Lost msgs

root@vrrp2:~# conntrackd -s
cache internal:
current active connections:                2
connections created:                       2    failed:            0
connections updated:                       0    failed:            0
connections destroyed:                     0    failed:            0

cache external:
current active connections:                2
connections created:                       6    failed:            0
connections updated:                       0    failed:            0
connections destroyed:                     4    failed:            0

traffic processed:
                    0 Bytes                         0 Pckts

UDP traffic (active device=eth2):
                 8552 Bytes sent                 8704 Bytes recv
                  519 Pckts sent                  519 Pckts recv
                    0 Error send                    0 Error recv

message tracking:
                    0 Malformed msgs                    0 Lost msgs

but the commit error is always there in the logs.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: failing fail-over - commit still in progress
  2023-08-24  9:59           ` Pierre-Philipp Braun
@ 2023-08-28  8:02             ` Pablo Neira Ayuso
       [not found]               ` <f1291caf-2103-3fcb-7e60-e5a3218624ad@nethence.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Pablo Neira Ayuso @ 2023-08-28  8:02 UTC (permalink / raw)
  To: Pierre-Philipp Braun; +Cc: netfilter

On Thu, Aug 24, 2023 at 12:59:47PM +0300, Pierre-Philipp Braun wrote:
> On 8/21/23 12:26, Pablo Neira Ayuso wrote:
> > As I said before, you have to have a stateful ruleset which does not
> > pick up states from the middle.
> 
> I am now filtering both interfaces, the front-facing and the internal one, on the FORWARD chain.
> 
> table inet filter {
>         chain input {
>                 type filter hook input priority filter; policy accept;
>         }
> 
>         chain forward {
>                 type filter hook forward priority filter; policy drop;
>                 ip protocol icmp accept
>                 ct state invalid log prefix "INVALID: " drop
>                 ct state established,related,new accept
>                 log prefix "DROP POLICY: "
>         }
> 
>         chain output {
>                 type filter hook output priority filter; policy accept;
>         }
> }
> table ip nat {
>         chain postrouting {
>                 type nat hook postrouting priority srcnat; policy accept;
>                 ip saddr 10.1.0.0/16 oif "eth0" snat to 217.19.208.157
>         }
> 
>         chain prerouting {
>                 type nat hook prerouting priority dstnat; policy accept;
>                 iif "eth0" tcp dport 50 dnat to 10.1.0.50:22
>         }
> }
> 
> However it looks like I am still tracking in the middle.
> 
> > nftables is completely irrelevant in this picture. State
> > synchronization relies on ctnetlink and userspace conntrackd for state
> > synchronization. nftables is only the packet classification framework.
> 
> I was just wondering if I absolutely had to use the iptables example from the testcase sample.
> I notice that example has additional SYN flag.
> Basically I am doing things the other way around, DNAT instead of SNAT.
>
> > Rule of thumb is: You have disable nf_conntrack_tcp_loose from
> > conntrack and a stateful ruleset which drops packets that are in
> > invalid state.
> 
> tcp_loose is zero and as for the stateful ruleset, I am still not sure.
> 
> > Otherwise, state synchronization does not make sense because conntrack
> > can pick connections from the middle, ie. you can implement "poor man"
> > failover and let conntrack recover the history from the middle.
> 
> This seems to be what is happening (correct me if I am wrong).  I sometimes manage to successfully fail-over after checking carefully that things are in order, namely that the state is there on internal and external cache respectively.  Anyhow the commit error remains.
> 
> > Can you see events on the active node with `conntrack -E`?
> 
> No, and I have to restart conntrackd to actually be able to see the states (I guess thanks to `StartupResync on`).

Did you enable CONFIG_NF_CONNTRACK_EVENTS in your kernel?

CONFIG_NF_CONNTRACK_EVENTS=y

`conntrack -E' should show events regardless your conntrackd
configuration when you create new flows.

> > Did you debug with tcpdump on both ends to check to see if conntrackd
> > delivers the synchronization messages?
> 
> Yes I noticed an UDP datagram that is larger than the other ones.

You should see UDP traffic in port 3780, unless you have changed your
Port in your conntrackd.conf configuration file.

> > What do conntrackd stats tell you? There is a good number of options
> > that allow you debug your setup.
> 
> right after a successful fail-over
> 
> root@vrrp1:~# conntrackd -s
> cache internal:
> current active connections:                2
> connections created:                       4    failed:            0
> connections updated:                       0    failed:            0
> connections destroyed:                     2    failed:            0
> 
> cache external:
> current active connections:                2
> connections created:                       4    failed:            0
> connections updated:                       0    failed:            0
> connections destroyed:                     2    failed:            0
> 
> traffic processed:
>                    0 Bytes                         0 Pckts
> 
> UDP traffic (active device=eth2):
>                 8384 Bytes sent                 8216 Bytes recv
>                  499 Pckts sent                  498 Pckts recv
>                    0 Error send                    0 Error recv
> 
> message tracking:
>                    0 Malformed msgs                    0 Lost msgs
> 
> root@vrrp2:~# conntrackd -s
> cache internal:
> current active connections:                2
> connections created:                       2    failed:            0
> connections updated:                       0    failed:            0
> connections destroyed:                     0    failed:            0
> 
> cache external:
> current active connections:                2
> connections created:                       6    failed:            0
> connections updated:                       0    failed:            0
> connections destroyed:                     4    failed:            0
> 
> traffic processed:
>                    0 Bytes                         0 Pckts
> 
> UDP traffic (active device=eth2):
>                 8552 Bytes sent                 8704 Bytes recv
>                  519 Pckts sent                  519 Pckts recv
>                    0 Error send                    0 Error recv
> 
> message tracking:
>                    0 Malformed msgs                    0 Lost msgs
> 
> but the commit error is always there in the logs.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: failing fail-over - commit still in progress
       [not found]               ` <f1291caf-2103-3fcb-7e60-e5a3218624ad@nethence.com>
@ 2023-09-01  8:37                 ` Pablo Neira Ayuso
  0 siblings, 0 replies; 10+ messages in thread
From: Pablo Neira Ayuso @ 2023-09-01  8:37 UTC (permalink / raw)
  To: Pierre-Philipp Braun; +Cc: netfilter

Hi,

[ Restoring Cc to netfilter@vger.kernel.org ]

On Fri, Sep 01, 2023 at 05:03:47AM +0300, Pierre-Philipp Braun wrote:
> > Did you enable CONFIG_NF_CONNTRACK_EVENTS in your kernel?
> > 
> > CONFIG_NF_CONNTRACK_EVENTS=y
> > 
> > `conntrack -E' should show events regardless your conntrackd
> > configuration when you create new flows.
> 
> I enabled NF_CONNTRACK_EVENTS and it works much better now.

This is described as a requirement in the documentation:

http://conntrack-tools.netfilter.org/manual.html

> The state shows up right away in the internal vs. external cache,
> and the fail-over works in both directions.  The MWE nftables sample
> I showed lately seems "valid", finally, as I am not catching the
> states in the middle anymore.  We can now rule-out the firewall
> issue right?
> 
> However, during a fail-over, I always see this anyhow on a receiving node:
> 
> [Fri Sep  1 00:44:21 2023] (pid=1069) [notice] committing all external caches
> [Fri Sep  1 00:44:21 2023] (pid=1069) [notice] Committed 1 new entries
> [Fri Sep  1 00:44:21 2023] (pid=1069) [notice] commit has taken 0.000059 seconds
> [Fri Sep  1 00:44:21 2023] (pid=1069) [ERROR] ignoring flush command, commit still in progress
> [Fri Sep  1 00:44:21 2023] (pid=1069) [notice] resync with master conntrack table
> [Fri Sep  1 00:44:21 2023] (pid=1069) [notice] sending bulk update
> [Fri Sep  1 00:44:29 2023] (pid=1069) [notice] resync requested by other node
> [Fri Sep  1 00:44:29 2023] (pid=1069) [notice] sending bulk update
> 
> and after that, the state is sometimes seen internal + external on
> the receiving node, otherwise everywhere internal + external +
> conntrack -L on both nodes.
> 
> In the worst and latter case, things settle down when I restart
> conntrackd on both nodes.

How are you integrating conntrackd with keepalived? Are you using
the doc/sync/primary-backup.sh script?

The error above means that the flush command was sent to conntrackd
while there was a pending commit in progress.

> > You should see UDP traffic in port 3780, unless you have changed your
> > Port in your conntrackd.conf configuration file.
> 
> Yes, I see the traffic in both directions.
> 
> I don't know what else could be wrong.  So I dig into the states
> tracker daemon config a little more.

This filtering option you are exploring below has nothing to do with
the problem you are reporting above.

> About this additional stanza in conntrackd.conf is "lazy replicas"
> equivalent of catching the states "in the middle"?  The sync error
> happens either case but I would like to make sure I can keep that
> disabled.
> 
>                 # Uncomment this line below if you want to filter by flow state.
>                 # This option introduces a trade-off in the replication: it
>                 # reduces CPU consumption at the cost of having lazy backup
>                 # firewall replicas. The existing TCP states are: SYN_SENT,
>                 # SYN_RECV, ESTABLISHED, FIN_WAIT, CLOSE_WAIT, LAST_ACK,
>                 # TIME_WAIT, CLOSED, LISTEN.
>                 #
>                 # State Accept {
>                 #       ESTABLISHED CLOSED TIME_WAIT CLOSE_WAIT for TCP
>                 # }

These are filtering option to reduce the number of synchronization
messages, see documentation.

> About the Address ignore list, I tried following the FTFW sample by
> ignoring local IPs and VIPs.  Also the other way around for testing,
> tracking --only-- the target systems behind DNAT:
> 
>                 Address Accept {
>                         IPv4_address 10.1.0.0/16
>                 }

This is again a filtering option to reduce synchronization messages.

> ==> gives the same result, exact same symptoms (commit error and states are all around)
> 
> 
> What could be causing the commit error...

Please, provide more information on how you integrate keepalived with
conntrackd. Revisit documentation to make sure things are done
according to what it described.

> ... and the states to show up on both nodes even with conntrack -L?

See doc/sync/primary-backup.sh script does not flushes the kernel
table after failover, instead it shortens the timeout of the old
entries in the backup to let them expire sooner:

        $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -t

> This is how the PoC 2 setup looks like now:
> 
> linux 6.1.49.domU (defconfig+xen+few more things built-in, no modules)
> and debian 12.1 packages
>   conntrack-tools v1.4.7
>   libnfnetlink 1.0.2
>   keepalived 2.2.7
> 
> Thanks

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-09-01  8:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-11  8:55 failing fail-over - commit still in progress Pierre-Philipp Braun
2023-08-11  8:58 ` Pierre-Philipp Braun
2023-08-11 10:53 ` Pablo Neira Ayuso
2023-08-12  9:52   ` Pierre-Philipp Braun
2023-08-12 21:08     ` Pablo Neira Ayuso
2023-08-21  6:19       ` Pierre-Philipp Braun
2023-08-21  9:26         ` Pablo Neira Ayuso
2023-08-24  9:59           ` Pierre-Philipp Braun
2023-08-28  8:02             ` Pablo Neira Ayuso
     [not found]               ` <f1291caf-2103-3fcb-7e60-e5a3218624ad@nethence.com>
2023-09-01  8:37                 ` Pablo Neira Ayuso

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.