All of lore.kernel.org
 help / color / mirror / Atom feed
* Most optimal method to dump UDP conntrack entries
@ 2024-10-17 10:26 Antonio Ojea
  2024-10-17 12:46 ` Florian Westphal
  0 siblings, 1 reply; 23+ messages in thread
From: Antonio Ojea @ 2024-10-17 10:26 UTC (permalink / raw)
  To: netfilter

Hi,

In the context of Kubernetes, when DNATing entries for UDP Services,
we need to deal with some edge cases where some UDP entries are left
orphaned but blackhole the traffic to the new endpoints.

At high level, the scenario is:
- Client IP_A sends UDP traffic to VirtualIP IP_B --> Kubernetes
Translates this to Endpoint IP_C
- Endpoint IP_C is replaced by Endpoint IP_D, but since Client IP_A
does not stop sending traffic, the conntrack entry IP_A IP_B --> IP_C
takes precedence and is being renewed, so traffic is not sent to the
new Endpoint IP_D and is lost.

To solve this problem, we have some heuristics to detect those
scenarios when the endpoints change and flush the conntrack entries,
however, since this is event based, if we lost the event that
triggered the problem or something happens that fails to clean up the
entry,  the user need to manually flush the entries.

We are implementing a new approach to solve this, we list all the UDP
conntrack entries using netlink, compare against the existing
programmed nftables/iptables rules, and flush the ones we know are
stale.

During the implementation review, the question [1] this raises is, how
impactful is it to dump all the conntrack entries each time we program
the iptables/nftables rules (this can be every 1s on nodes with a lot
of entries)?
Is this approach completely safe?
Should we try to read from procfs instead?
Any other suggestions?

Thanks,
A.Ojea


[1]: https://github.com/kubernetes/kubernetes/pull/127318#discussion_r1756967553

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-10-17 10:26 Most optimal method to dump UDP conntrack entries Antonio Ojea
@ 2024-10-17 12:46 ` Florian Westphal
  2024-10-17 16:36   ` Pablo Neira Ayuso
  0 siblings, 1 reply; 23+ messages in thread
From: Florian Westphal @ 2024-10-17 12:46 UTC (permalink / raw)
  To: Antonio Ojea; +Cc: netfilter

Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> In the context of Kubernetes, when DNATing entries for UDP Services,
> we need to deal with some edge cases where some UDP entries are left
> orphaned but blackhole the traffic to the new endpoints.
> 
> At high level, the scenario is:
> - Client IP_A sends UDP traffic to VirtualIP IP_B --> Kubernetes
> Translates this to Endpoint IP_C
> - Endpoint IP_C is replaced by Endpoint IP_D, but since Client IP_A
> does not stop sending traffic, the conntrack entry IP_A IP_B --> IP_C
> takes precedence and is being renewed, so traffic is not sent to the
> new Endpoint IP_D and is lost.
> 
> To solve this problem, we have some heuristics to detect those
> scenarios when the endpoints change and flush the conntrack entries,
> however, since this is event based, if we lost the event that
> triggered the problem or something happens that fails to clean up the
> entry,  the user need to manually flush the entries.
> 
> We are implementing a new approach to solve this, we list all the UDP
> conntrack entries using netlink, compare against the existing
> programmed nftables/iptables rules, and flush the ones we know are
> stale.
> 
> During the implementation review, the question [1] this raises is, how
> impactful is it to dump all the conntrack entries each time we program
> the iptables/nftables rules (this can be every 1s on nodes with a lot
> of entries)?
> Is this approach completely safe?
> Should we try to read from procfs instead?

Walking all conntrack entries in 1s intervals is going to be slow, no
matter the chosen interface.  Even doing the filtering in the kernel to
not dump all entries but only those that match udp/port/ip criteria is
not going to change it.

Also both proc and netlink dumps can miss entries (albeit its rare),
if parallel insertions/deletes happen (which is normal on busy system).

I wonder why the appropriate delete requests cannot be done when the
mapping is altered, I mean, you must have some code that issues
either iptables -t nat -D ... or nft delete element ... or similar.

If you do that, why not also fire off the conntrack -D request
afterwards?  Or are these publish/withdraw so frequent that this
doesn't matter compared to poll based approach?

Something like
   conntrack -D --protonum 17 --orig-dst $vserver --orig-port-dst 53 --reply-src $rserver --reply-port-src 5353

would zap everything to $rserver mapped to $vserver from client point of view.

Granted, this isn't great either, but you would not have to poll
all the time?  Or are updates

Is this only a problem for UDP?  I wonder if we should change UDP
conntrack to no longer refresh timestamp for original direction if
connection is subject to NAT, that would make them expire in the given
'dnat mapping went away and client tries to talk to unreachable host'
scenario.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-10-17 12:46 ` Florian Westphal
@ 2024-10-17 16:36   ` Pablo Neira Ayuso
  2024-10-17 22:10     ` Antonio Ojea
  0 siblings, 1 reply; 23+ messages in thread
From: Pablo Neira Ayuso @ 2024-10-17 16:36 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Antonio Ojea, netfilter

On Thu, Oct 17, 2024 at 02:46:32PM +0200, Florian Westphal wrote:
> Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> > In the context of Kubernetes, when DNATing entries for UDP Services,
> > we need to deal with some edge cases where some UDP entries are left
> > orphaned but blackhole the traffic to the new endpoints.
> > 
> > At high level, the scenario is:
> > - Client IP_A sends UDP traffic to VirtualIP IP_B --> Kubernetes
> > Translates this to Endpoint IP_C
> > - Endpoint IP_C is replaced by Endpoint IP_D, but since Client IP_A
> > does not stop sending traffic, the conntrack entry IP_A IP_B --> IP_C
> > takes precedence and is being renewed, so traffic is not sent to the
> > new Endpoint IP_D and is lost.
> > 
> > To solve this problem, we have some heuristics to detect those
> > scenarios when the endpoints change and flush the conntrack entries,
> > however, since this is event based, if we lost the event that
> > triggered the problem or something happens that fails to clean up the
> > entry,  the user need to manually flush the entries.
> > 
> > We are implementing a new approach to solve this, we list all the UDP
> > conntrack entries using netlink, compare against the existing
> > programmed nftables/iptables rules, and flush the ones we know are
> > stale.
> > 
> > During the implementation review, the question [1] this raises is, how
> > impactful is it to dump all the conntrack entries each time we program
> > the iptables/nftables rules (this can be every 1s on nodes with a lot
> > of entries)?
> > Is this approach completely safe?
> > Should we try to read from procfs instead?
> 
> Walking all conntrack entries in 1s intervals is going to be slow, no
> matter the chosen interface.  Even doing the filtering in the kernel to
> not dump all entries but only those that match udp/port/ip criteria is
> not going to change it.
> 
> Also both proc and netlink dumps can miss entries (albeit its rare),
> if parallel insertions/deletes happen (which is normal on busy system).
> 
> I wonder why the appropriate delete requests cannot be done when the
> mapping is altered, I mean, you must have some code that issues
> either iptables -t nat -D ... or nft delete element ... or similar.
> 
> If you do that, why not also fire off the conntrack -D request
> afterwards?  Or are these publish/withdraw so frequent that this
> doesn't matter compared to poll based approach?
> 
> Something like
>    conntrack -D --protonum 17 --orig-dst $vserver --orig-port-dst 53 --reply-src $rserver --reply-port-src 5353
> 
> would zap everything to $rserver mapped to $vserver from client point of view.

This reminds me, it would be good to expand conntrack utility to use
the new kernel API to filter from kernel + delete.

I will try to get here.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-10-17 16:36   ` Pablo Neira Ayuso
@ 2024-10-17 22:10     ` Antonio Ojea
  2024-10-17 23:30       ` Florian Westphal
  2024-11-11 12:06       ` Pablo Neira Ayuso
  0 siblings, 2 replies; 23+ messages in thread
From: Antonio Ojea @ 2024-10-17 22:10 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter

On Thu, 17 Oct 2024 at 17:36, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>
> On Thu, Oct 17, 2024 at 02:46:32PM +0200, Florian Westphal wrote:
> > Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> > > In the context of Kubernetes, when DNATing entries for UDP Services,
> > > we need to deal with some edge cases where some UDP entries are left
> > > orphaned but blackhole the traffic to the new endpoints.
> > >
> > > At high level, the scenario is:
> > > - Client IP_A sends UDP traffic to VirtualIP IP_B --> Kubernetes
> > > Translates this to Endpoint IP_C
> > > - Endpoint IP_C is replaced by Endpoint IP_D, but since Client IP_A
> > > does not stop sending traffic, the conntrack entry IP_A IP_B --> IP_C
> > > takes precedence and is being renewed, so traffic is not sent to the
> > > new Endpoint IP_D and is lost.
> > >
> > > To solve this problem, we have some heuristics to detect those
> > > scenarios when the endpoints change and flush the conntrack entries,
> > > however, since this is event based, if we lost the event that
> > > triggered the problem or something happens that fails to clean up the
> > > entry,  the user need to manually flush the entries.
> > >
> > > We are implementing a new approach to solve this, we list all the UDP
> > > conntrack entries using netlink, compare against the existing
> > > programmed nftables/iptables rules, and flush the ones we know are
> > > stale.
> > >
> > > During the implementation review, the question [1] this raises is, how
> > > impactful is it to dump all the conntrack entries each time we program
> > > the iptables/nftables rules (this can be every 1s on nodes with a lot
> > > of entries)?
> > > Is this approach completely safe?
> > > Should we try to read from procfs instead?
> >
> > Walking all conntrack entries in 1s intervals is going to be slow, no
> > matter the chosen interface.  Even doing the filtering in the kernel to
> > not dump all entries but only those that match udp/port/ip criteria is
> > not going to change it.

We are not worried about being slow in the order of seconds, the
system is eventually consistent so there can always be a reasonable
latency.
Since we only care about UDP, losing packets during that period is not
desirable but is assumable.
My main concern is if constantly dumping all the entries via netlink
can cause any issue or increase resources consumption.

> >
> > Also both proc and netlink dumps can miss entries (albeit its rare),
> > if parallel insertions/deletes happen (which is normal on busy system).
> >

That is one of the reasons we want to implement this reconcile loop,
so it can be resilient to this kind of errors, we keep the state on
the API in the control plane, so we can always rebuild the state in
the dataplane (recreating nftables rules and delete conntrack entries
that does not match the current state)

> > I wonder why the appropriate delete requests cannot be done when the
> > mapping is altered, I mean, you must have some code that issues
> > either iptables -t nat -D ... or nft delete element ... or similar.
> >
> > If you do that, why not also fire off the conntrack -D request
> > afterwards?  Or are these publish/withdraw so frequent that this
> > doesn't matter compared to poll based approach?
> >
> > Something like
> >    conntrack -D --protonum 17 --orig-dst $vserver --orig-port-dst 53 --reply-src $rserver --reply-port-src 5353
> >
> > would zap everything to $rserver mapped to $vserver from client point of view.
>

This is how it is implemented today and it works, but it does not
handle process restarts per example, or is not resilient to errors.
The implementation is also much more complex because we need to
implement all the possible edge cases that can leave stale entries


> This reminds me, it would be good to expand conntrack utility to use
> the new kernel API to filter from kernel + delete.
>
> I will try to get here.

I should bring more of these problems to the mailing list :-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-10-17 22:10     ` Antonio Ojea
@ 2024-10-17 23:30       ` Florian Westphal
  2024-10-18 11:05         ` Antonio Ojea
  2024-11-11 12:06       ` Pablo Neira Ayuso
  1 sibling, 1 reply; 23+ messages in thread
From: Florian Westphal @ 2024-10-17 23:30 UTC (permalink / raw)
  To: Antonio Ojea; +Cc: Pablo Neira Ayuso, Florian Westphal, netfilter

Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> We are not worried about being slow in the order of seconds, the
> system is eventually consistent so there can always be a reasonable
> latency.
> Since we only care about UDP, losing packets during that period is not
> desirable but is assumable.
> My main concern is if constantly dumping all the entries via netlink
> can cause any issue or increase resources consumption.

It doesn't need lots of extra memory on kernel side, just for the
some internal tracking ("where were we") and temporary netlink
buffer to assemble the dump chunks returned to userspace.

> > > Also both proc and netlink dumps can miss entries (albeit its rare),
> > > if parallel insertions/deletes happen (which is normal on busy system).
> 
> That is one of the reasons we want to implement this reconcile loop,
> so it can be resilient to this kind of errors, we keep the state on
> the API in the control plane, so we can always rebuild the state in
> the dataplane (recreating nftables rules and delete conntrack entries
> that does not match the current state)

OK, but wouldn't it be better to not need this at all?

The UDP tracker is old and refreshs the timeout for every packet, but
who says it needs to stay that way?

With very simple patch to only refresh in 'reply' dir:

    [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
[DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
    [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
[DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
    [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819

This is with a unidirectional UDP stream.

diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
--- a/net/netfilter/nf_conntrack_proto_udp.c
+++ b/net/netfilter/nf_conntrack_proto_udp.c
@@ -17,6 +17,7 @@
 #include <linux/netfilter.h>
 #include <linux/netfilter_ipv4.h>
 #include <linux/netfilter_ipv6.h>
+#include <net/netfilter/nf_conntrack_acct.h>
 #include <net/netfilter/nf_conntrack_l4proto.h>
 #include <net/netfilter/nf_conntrack_ecache.h>
 #include <net/netfilter/nf_conntrack_timeout.h>
@@ -80,6 +81,35 @@ static bool udp_error(struct sk_buff *skb,
 	return false;
 }
 
+static void nf_ct_refresh_udp(struct nf_conn *ct,
+			      enum ip_conntrack_dir dir,
+			      u32 len, u32 extra_jiffies)
+{
+	/* Only update if this is not a fixed timeout */
+	if (test_bit(IPS_FIXED_TIMEOUT_BIT, &ct->status))
+		return nf_ct_acct_update(ct, dir, len);
+
+	switch (dir) {
+	case IP_CT_DIR_ORIGINAL:
+		/* do not refresh in original dir, else stale dnat mapping
+		 * can be kept alive indefinitely.
+		 */
+		if (nf_ct_is_confirmed(ct))
+			return nf_ct_acct_update(ct, dir, len);
+		break;
+	case IP_CT_DIR_REPLY:
+		extra_jiffies += nfct_time_stamp;
+		break;
+	case IP_CT_DIR_MAX: /* silence warning wrt. unhandled enum */
+		break;
+	}
+
+	if (READ_ONCE(ct->timeout) != extra_jiffies)
+		WRITE_ONCE(ct->timeout, extra_jiffies);
+
+	nf_ct_acct_update(ct, dir, len);
+}
+
 /* Returns verdict for packet, and may modify conntracktype */
 int nf_conntrack_udp_packet(struct nf_conn *ct,
 			    struct sk_buff *skb,
@@ -114,7 +144,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
 			stream = (status & IPS_ASSURED) == 0;
 		}
 
-		nf_ct_refresh_acct(ct, ctinfo, skb, extra);
+		nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, extra);
 
 		/* never set ASSURED for IPS_NAT_CLASH, they time out soon */
 		if (unlikely((status & IPS_NAT_CLASH)))
@@ -124,7 +154,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
 		if (stream && !test_and_set_bit(IPS_ASSURED_BIT, &ct->status))
 			nf_conntrack_event_cache(IPCT_ASSURED, ct);
 	} else {
-		nf_ct_refresh_acct(ct, ctinfo, skb, timeouts[UDP_CT_UNREPLIED]);
+		nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, timeouts[UDP_CT_UNREPLIED]);
 	}
 	return NF_ACCEPT;
 }

The obvious counter-argument is "but what if its the reply side that is
sending, and we have a stale SNAT mapping?".

Solution would be to track timestamp of last packet seen per direction,
and then stop refreshing the timeout if one side ceases sending.

I think thats much better than what we have now and there would be no
need to actively zap UDP flows, they would timeout automatically once
one side ceases to transmit.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-10-17 23:30       ` Florian Westphal
@ 2024-10-18 11:05         ` Antonio Ojea
  2024-10-18 11:33           ` Florian Westphal
  0 siblings, 1 reply; 23+ messages in thread
From: Antonio Ojea @ 2024-10-18 11:05 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter

On Fri, 18 Oct 2024 at 00:30, Florian Westphal <fw@strlen.de> wrote:
>
> Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> > We are not worried about being slow in the order of seconds, the
> > system is eventually consistent so there can always be a reasonable
> > latency.
> > Since we only care about UDP, losing packets during that period is not
> > desirable but is assumable.
> > My main concern is if constantly dumping all the entries via netlink
> > can cause any issue or increase resources consumption.
>
> It doesn't need lots of extra memory on kernel side, just for the
> some internal tracking ("where were we") and temporary netlink
> buffer to assemble the dump chunks returned to userspace.
>
> > > > Also both proc and netlink dumps can miss entries (albeit its rare),
> > > > if parallel insertions/deletes happen (which is normal on busy system).
> >
> > That is one of the reasons we want to implement this reconcile loop,
> > so it can be resilient to this kind of errors, we keep the state on
> > the API in the control plane, so we can always rebuild the state in
> > the dataplane (recreating nftables rules and delete conntrack entries
> > that does not match the current state)
>
> OK, but wouldn't it be better to not need this at all?
>

Absolutely, that will be awesome

> The UDP tracker is old and refreshs the timeout for every packet, but
> who says it needs to stay that way?
>
> With very simple patch to only refresh in 'reply' dir:
>
>     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> [DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
>     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> [DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
>     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
>
> This is with a unidirectional UDP stream.
>
> diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
> --- a/net/netfilter/nf_conntrack_proto_udp.c
> +++ b/net/netfilter/nf_conntrack_proto_udp.c
> @@ -17,6 +17,7 @@
>  #include <linux/netfilter.h>
>  #include <linux/netfilter_ipv4.h>
>  #include <linux/netfilter_ipv6.h>
> +#include <net/netfilter/nf_conntrack_acct.h>
>  #include <net/netfilter/nf_conntrack_l4proto.h>
>  #include <net/netfilter/nf_conntrack_ecache.h>
>  #include <net/netfilter/nf_conntrack_timeout.h>
> @@ -80,6 +81,35 @@ static bool udp_error(struct sk_buff *skb,
>         return false;
>  }
>
> +static void nf_ct_refresh_udp(struct nf_conn *ct,
> +                             enum ip_conntrack_dir dir,
> +                             u32 len, u32 extra_jiffies)
> +{
> +       /* Only update if this is not a fixed timeout */
> +       if (test_bit(IPS_FIXED_TIMEOUT_BIT, &ct->status))
> +               return nf_ct_acct_update(ct, dir, len);
> +
> +       switch (dir) {
> +       case IP_CT_DIR_ORIGINAL:
> +               /* do not refresh in original dir, else stale dnat mapping
> +                * can be kept alive indefinitely.
> +                */
> +               if (nf_ct_is_confirmed(ct))
> +                       return nf_ct_acct_update(ct, dir, len);
> +               break;
> +       case IP_CT_DIR_REPLY:
> +               extra_jiffies += nfct_time_stamp;
> +               break;
> +       case IP_CT_DIR_MAX: /* silence warning wrt. unhandled enum */
> +               break;
> +       }
> +
> +       if (READ_ONCE(ct->timeout) != extra_jiffies)
> +               WRITE_ONCE(ct->timeout, extra_jiffies);
> +
> +       nf_ct_acct_update(ct, dir, len);
> +}
> +
>  /* Returns verdict for packet, and may modify conntracktype */
>  int nf_conntrack_udp_packet(struct nf_conn *ct,
>                             struct sk_buff *skb,
> @@ -114,7 +144,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
>                         stream = (status & IPS_ASSURED) == 0;
>                 }
>
> -               nf_ct_refresh_acct(ct, ctinfo, skb, extra);
> +               nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, extra);
>
>                 /* never set ASSURED for IPS_NAT_CLASH, they time out soon */
>                 if (unlikely((status & IPS_NAT_CLASH)))
> @@ -124,7 +154,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
>                 if (stream && !test_and_set_bit(IPS_ASSURED_BIT, &ct->status))
>                         nf_conntrack_event_cache(IPCT_ASSURED, ct);
>         } else {
> -               nf_ct_refresh_acct(ct, ctinfo, skb, timeouts[UDP_CT_UNREPLIED]);
> +               nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, timeouts[UDP_CT_UNREPLIED]);
>         }
>         return NF_ACCEPT;
>  }
>
> The obvious counter-argument is "but what if its the reply side that is
> sending, and we have a stale SNAT mapping?".
>
> Solution would be to track timestamp of last packet seen per direction,
> and then stop refreshing the timeout if one side ceases sending.
>
> I think thats much better than what we have now and there would be no
> need to actively zap UDP flows, they would timeout automatically once
> one side ceases to transmit.

If I understand correctly this will have some visible effects, before
this change the same tuple will only hit the same destination because
the conntrack entry will prevail, and after this patch the packet will
be spread out randomly to the number of backends.
I'm aware of some requests to have this behavior in kubernetes for UDP
service, but I'm scared that there can be some people relying on this
behavior ... and all the possible edge cases this will open, what
happens if we have two UDP packets in one direction before the first
reply is seen?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-10-18 11:05         ` Antonio Ojea
@ 2024-10-18 11:33           ` Florian Westphal
  2024-10-18 14:10             ` Antonio Ojea
  0 siblings, 1 reply; 23+ messages in thread
From: Florian Westphal @ 2024-10-18 11:33 UTC (permalink / raw)
  To: Antonio Ojea; +Cc: Florian Westphal, Pablo Neira Ayuso, netfilter

Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> > The UDP tracker is old and refreshs the timeout for every packet, but
> > who says it needs to stay that way?
> >
> > With very simple patch to only refresh in 'reply' dir:
> >
> >     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > [DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> >     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > [DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> >     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> >
> > This is with a unidirectional UDP stream.
> >
> > diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
> > --- a/net/netfilter/nf_conntrack_proto_udp.c
> > +++ b/net/netfilter/nf_conntrack_proto_udp.c
> > @@ -17,6 +17,7 @@
> >  #include <linux/netfilter.h>
> >  #include <linux/netfilter_ipv4.h>
> >  #include <linux/netfilter_ipv6.h>
> > +#include <net/netfilter/nf_conntrack_acct.h>
> >  #include <net/netfilter/nf_conntrack_l4proto.h>
> >  #include <net/netfilter/nf_conntrack_ecache.h>
> >  #include <net/netfilter/nf_conntrack_timeout.h>
> > @@ -80,6 +81,35 @@ static bool udp_error(struct sk_buff *skb,
> >         return false;
> >  }
> >
> > +static void nf_ct_refresh_udp(struct nf_conn *ct,
> > +                             enum ip_conntrack_dir dir,
> > +                             u32 len, u32 extra_jiffies)
> > +{
> > +       /* Only update if this is not a fixed timeout */
> > +       if (test_bit(IPS_FIXED_TIMEOUT_BIT, &ct->status))
> > +               return nf_ct_acct_update(ct, dir, len);
> > +
> > +       switch (dir) {
> > +       case IP_CT_DIR_ORIGINAL:
> > +               /* do not refresh in original dir, else stale dnat mapping
> > +                * can be kept alive indefinitely.
> > +                */
> > +               if (nf_ct_is_confirmed(ct))
> > +                       return nf_ct_acct_update(ct, dir, len);
> > +               break;
> > +       case IP_CT_DIR_REPLY:
> > +               extra_jiffies += nfct_time_stamp;
> > +               break;
> > +       case IP_CT_DIR_MAX: /* silence warning wrt. unhandled enum */
> > +               break;
> > +       }
> > +
> > +       if (READ_ONCE(ct->timeout) != extra_jiffies)
> > +               WRITE_ONCE(ct->timeout, extra_jiffies);
> > +
> > +       nf_ct_acct_update(ct, dir, len);
> > +}
> > +
> >  /* Returns verdict for packet, and may modify conntracktype */
> >  int nf_conntrack_udp_packet(struct nf_conn *ct,
> >                             struct sk_buff *skb,
> > @@ -114,7 +144,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
> >                         stream = (status & IPS_ASSURED) == 0;
> >                 }
> >
> > -               nf_ct_refresh_acct(ct, ctinfo, skb, extra);
> > +               nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, extra);
> >
> >                 /* never set ASSURED for IPS_NAT_CLASH, they time out soon */
> >                 if (unlikely((status & IPS_NAT_CLASH)))
> > @@ -124,7 +154,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
> >                 if (stream && !test_and_set_bit(IPS_ASSURED_BIT, &ct->status))
> >                         nf_conntrack_event_cache(IPCT_ASSURED, ct);
> >         } else {
> > -               nf_ct_refresh_acct(ct, ctinfo, skb, timeouts[UDP_CT_UNREPLIED]);
> > +               nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, timeouts[UDP_CT_UNREPLIED]);
> >         }
> >         return NF_ACCEPT;
> >  }
> >
> > The obvious counter-argument is "but what if its the reply side that is
> > sending, and we have a stale SNAT mapping?".
> >
> > Solution would be to track timestamp of last packet seen per direction,
> > and then stop refreshing the timeout if one side ceases sending.
> >
> > I think thats much better than what we have now and there would be no
> > need to actively zap UDP flows, they would timeout automatically once
> > one side ceases to transmit.
> 
> If I understand correctly this will have some visible effects, before
> this change the same tuple will only hit the same destination because
> the conntrack entry will prevail, and after this patch the packet will
> be spread out randomly to the number of backends.

Nothing changes unless flow is 100% uni-directional (no
return traffic OR only reply traffic after initial packet).

Old behavior:

Packet	 Direction
     1   Origin			# create entry with 30s timeout, save current time
     2   Reply			# flag entry assured (cannot be fast-removed
				# if conntrack table is full), refresh timeout to 30s
     3  Origin			# refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
     4  Origin			# refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
     ...
     5  Origin			# refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past


Thus, once DNAT mapping changes, conntrack entry remains alive for as
long as origin sends more packets.

After change:

Packet	 Direction
     1   Origin			# create entry with 30s timeout, save current time
     2   Reply			# flag entry assured (cannot be fast-removed
				# if conntrack table is full), refresh timeout to 30s
     3  Origin			# do not refresh timeout
     4  Origin			# do not refresh timeout
     ...
     5  Origin			# do not refresh timeout

Once 30s have elapsed, this entry will expire, even if packets are seen.
Next packet will create a new conntrack, and a new NAT mapping could be
chosen.


If there is reply traffic, nothing changes:

Packet	 Direction
     1   Origin			# create entry with 30s timeout, save current time
     2   Reply			# flag entry assured (cannot be fast-removed
				# if conntrack table is full), refresh timeout to 30s
     3  Origin			# do not refresh timeout
     4  Reply			# refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
     ...
     5  Origin			# do not refresh timeout

Thats the general idea anyway.
I'll try to make this more robust and also let it timeout if we have
only reply traffic without any origin packets.

> I'm aware of some requests to have this behavior in kubernetes for UDP
> service, but I'm scared that there can be some people relying on this
> behavior ... and all the possible edge cases this will open, what
> happens if we have two UDP packets in one direction before the first
> reply is seen?

Same as what happens now, 2nd packet follows NAT mapping of first one.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-10-18 11:33           ` Florian Westphal
@ 2024-10-18 14:10             ` Antonio Ojea
  2024-10-21 13:53               ` Florian Westphal
  0 siblings, 1 reply; 23+ messages in thread
From: Antonio Ojea @ 2024-10-18 14:10 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter

On Fri, 18 Oct 2024 at 13:33, Florian Westphal <fw@strlen.de> wrote:
>
> Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> > > The UDP tracker is old and refreshs the timeout for every packet, but
> > > who says it needs to stay that way?
> > >
> > > With very simple patch to only refresh in 'reply' dir:
> > >
> > >     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > > [DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > >     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > > [DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > >     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > >
> > > This is with a unidirectional UDP stream.
> > >
> > > diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
> > > --- a/net/netfilter/nf_conntrack_proto_udp.c
> > > +++ b/net/netfilter/nf_conntrack_proto_udp.c
> > > @@ -17,6 +17,7 @@
> > >  #include <linux/netfilter.h>
> > >  #include <linux/netfilter_ipv4.h>
> > >  #include <linux/netfilter_ipv6.h>
> > > +#include <net/netfilter/nf_conntrack_acct.h>
> > >  #include <net/netfilter/nf_conntrack_l4proto.h>
> > >  #include <net/netfilter/nf_conntrack_ecache.h>
> > >  #include <net/netfilter/nf_conntrack_timeout.h>
> > > @@ -80,6 +81,35 @@ static bool udp_error(struct sk_buff *skb,
> > >         return false;
> > >  }
> > >
> > > +static void nf_ct_refresh_udp(struct nf_conn *ct,
> > > +                             enum ip_conntrack_dir dir,
> > > +                             u32 len, u32 extra_jiffies)
> > > +{
> > > +       /* Only update if this is not a fixed timeout */
> > > +       if (test_bit(IPS_FIXED_TIMEOUT_BIT, &ct->status))
> > > +               return nf_ct_acct_update(ct, dir, len);
> > > +
> > > +       switch (dir) {
> > > +       case IP_CT_DIR_ORIGINAL:
> > > +               /* do not refresh in original dir, else stale dnat mapping
> > > +                * can be kept alive indefinitely.
> > > +                */
> > > +               if (nf_ct_is_confirmed(ct))
> > > +                       return nf_ct_acct_update(ct, dir, len);
> > > +               break;
> > > +       case IP_CT_DIR_REPLY:
> > > +               extra_jiffies += nfct_time_stamp;
> > > +               break;
> > > +       case IP_CT_DIR_MAX: /* silence warning wrt. unhandled enum */
> > > +               break;
> > > +       }
> > > +
> > > +       if (READ_ONCE(ct->timeout) != extra_jiffies)
> > > +               WRITE_ONCE(ct->timeout, extra_jiffies);
> > > +
> > > +       nf_ct_acct_update(ct, dir, len);
> > > +}
> > > +
> > >  /* Returns verdict for packet, and may modify conntracktype */
> > >  int nf_conntrack_udp_packet(struct nf_conn *ct,
> > >                             struct sk_buff *skb,
> > > @@ -114,7 +144,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
> > >                         stream = (status & IPS_ASSURED) == 0;
> > >                 }
> > >
> > > -               nf_ct_refresh_acct(ct, ctinfo, skb, extra);
> > > +               nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, extra);
> > >
> > >                 /* never set ASSURED for IPS_NAT_CLASH, they time out soon */
> > >                 if (unlikely((status & IPS_NAT_CLASH)))
> > > @@ -124,7 +154,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
> > >                 if (stream && !test_and_set_bit(IPS_ASSURED_BIT, &ct->status))
> > >                         nf_conntrack_event_cache(IPCT_ASSURED, ct);
> > >         } else {
> > > -               nf_ct_refresh_acct(ct, ctinfo, skb, timeouts[UDP_CT_UNREPLIED]);
> > > +               nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, timeouts[UDP_CT_UNREPLIED]);
> > >         }
> > >         return NF_ACCEPT;
> > >  }
> > >
> > > The obvious counter-argument is "but what if its the reply side that is
> > > sending, and we have a stale SNAT mapping?".
> > >
> > > Solution would be to track timestamp of last packet seen per direction,
> > > and then stop refreshing the timeout if one side ceases sending.
> > >
> > > I think thats much better than what we have now and there would be no
> > > need to actively zap UDP flows, they would timeout automatically once
> > > one side ceases to transmit.
> >
> > If I understand correctly this will have some visible effects, before
> > this change the same tuple will only hit the same destination because
> > the conntrack entry will prevail, and after this patch the packet will
> > be spread out randomly to the number of backends.
>
> Nothing changes unless flow is 100% uni-directional (no
> return traffic OR only reply traffic after initial packet).
>
> Old behavior:
>
> Packet   Direction
>      1   Origin                 # create entry with 30s timeout, save current time
>      2   Reply                  # flag entry assured (cannot be fast-removed
>                                 # if conntrack table is full), refresh timeout to 30s
>      3  Origin                  # refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
>      4  Origin                  # refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
>      ...
>      5  Origin                  # refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
>
>
> Thus, once DNAT mapping changes, conntrack entry remains alive for as
> long as origin sends more packets.
>
> After change:
>
> Packet   Direction
>      1   Origin                 # create entry with 30s timeout, save current time
>      2   Reply                  # flag entry assured (cannot be fast-removed
>                                 # if conntrack table is full), refresh timeout to 30s
>      3  Origin                  # do not refresh timeout
>      4  Origin                  # do not refresh timeout
>      ...
>      5  Origin                  # do not refresh timeout
>
> Once 30s have elapsed, this entry will expire, even if packets are seen.
> Next packet will create a new conntrack, and a new NAT mapping could be
> chosen.
>
>
> If there is reply traffic, nothing changes:
>
> Packet   Direction
>      1   Origin                 # create entry with 30s timeout, save current time
>      2   Reply                  # flag entry assured (cannot be fast-removed
>                                 # if conntrack table is full), refresh timeout to 30s
>      3  Origin                  # do not refresh timeout
>      4  Reply                   # refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
>      ...
>      5  Origin                  # do not refresh timeout
>
> Thats the general idea anyway.
> I'll try to make this more robust and also let it timeout if we have
> only reply traffic without any origin packets.
>
> > I'm aware of some requests to have this behavior in kubernetes for UDP
> > service, but I'm scared that there can be some people relying on this
> > behavior ... and all the possible edge cases this will open, what
> > happens if we have two UDP packets in one direction before the first
> > reply is seen?
>
> Same as what happens now, 2nd packet follows NAT mapping of first one.

This looks like the way to go ... if you can send me a patch I can do
some testing next week and report back

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-10-18 14:10             ` Antonio Ojea
@ 2024-10-21 13:53               ` Florian Westphal
  2024-10-23  9:03                 ` Benny Lyne Amorsen
  2024-11-10 21:50                 ` Florian Westphal
  0 siblings, 2 replies; 23+ messages in thread
From: Florian Westphal @ 2024-10-21 13:53 UTC (permalink / raw)
  To: Antonio Ojea; +Cc: Florian Westphal, Pablo Neira Ayuso, netfilter

Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> On Fri, 18 Oct 2024 at 13:33, Florian Westphal <fw@strlen.de> wrote:
> > Same as what happens now, 2nd packet follows NAT mapping of first one.
> 
> This looks like the way to go ... if you can send me a patch I can do
> some testing next week and report back

Here is a better patch, renew only when responses are seen.
This means that once either initiator or responder ceases to send
packets entry will time out.

Subject: netfilter: nf_conntrack_proto_udp: renew timeout only for bidirectional traffic

In commit e15d4cdf27cb
("netfilter: conntrack: do not renew entry stuck in tcp SYN_SENT state")
I changed TCP conntrack to let tcp entries time out in case initial SYN
packets never see SYN/ACK reply.  This allows NAT rules to be
re-evaluated again.

Do something similar for UDP and renew the timeout if the current
packet is the opposite direction as the last one.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack.h   |  3 ++-
 net/netfilter/nf_conntrack_proto_udp.c | 33 +++++++++++++++++---------
 2 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index cba3ccf03fcc..e0c098e2705d 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -25,7 +25,8 @@
 #include <net/netfilter/nf_conntrack_tuple.h>
 
 struct nf_ct_udp {
-	unsigned long	stream_ts;
+	unsigned long stream_ts;
+	enum ip_conntrack_dir last_dir;
 };
 
 /* per conntrack: protocol private data */
diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
index 0030fbe8885c..33fcf52eeaf4 100644
--- a/net/netfilter/nf_conntrack_proto_udp.c
+++ b/net/netfilter/nf_conntrack_proto_udp.c
@@ -80,6 +80,16 @@ static bool udp_error(struct sk_buff *skb,
 	return false;
 }
 
+static bool udp_ts_reply(struct nf_conn *ct, enum ip_conntrack_dir dir)
+{
+	bool is_reply = READ_ONCE(ct->proto.udp.last_dir) != dir;
+
+	if (is_reply)
+		WRITE_ONCE(ct->proto.udp.last_dir, dir);
+
+	return is_reply;
+}
+
 /* Returns verdict for packet, and may modify conntracktype */
 int nf_conntrack_udp_packet(struct nf_conn *ct,
 			    struct sk_buff *skb,
@@ -87,8 +97,9 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
 			    enum ip_conntrack_info ctinfo,
 			    const struct nf_hook_state *state)
 {
+	unsigned long status, extra;
+	enum ip_conntrack_dir dir;
 	unsigned int *timeouts;
-	unsigned long status;
 
 	if (udp_error(skb, dataoff, state))
 		return -NF_ACCEPT;
@@ -97,15 +108,19 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
 	if (!timeouts)
 		timeouts = udp_get_timeouts(nf_ct_net(ct));
 
+	extra = timeouts[UDP_CT_UNREPLIED];
 	status = READ_ONCE(ct->status);
-	if ((status & IPS_CONFIRMED) == 0)
+	if ((status & IPS_CONFIRMED) == 0) {
 		ct->proto.udp.stream_ts = 2 * HZ + jiffies;
+		ct->proto.udp.last_dir = IP_CT_DIR_MAX;
+	}
+
+	dir = CTINFO2DIR(ctinfo);
 
 	/* If we've seen traffic both ways, this is some kind of UDP
 	 * stream. Set Assured.
 	 */
 	if (status & IPS_SEEN_REPLY) {
-		unsigned long extra = timeouts[UDP_CT_UNREPLIED];
 		bool stream = false;
 
 		/* Still active after two seconds? Extend timeout. */
@@ -114,18 +129,14 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
 			stream = (status & IPS_ASSURED) == 0;
 		}
 
-		nf_ct_refresh_acct(ct, ctinfo, skb, extra);
-
-		/* never set ASSURED for IPS_NAT_CLASH, they time out soon */
-		if (unlikely((status & IPS_NAT_CLASH)))
-			return NF_ACCEPT;
-
 		/* Also, more likely to be important, and not a probe */
 		if (stream && !test_and_set_bit(IPS_ASSURED_BIT, &ct->status))
 			nf_conntrack_event_cache(IPCT_ASSURED, ct);
-	} else {
-		nf_ct_refresh_acct(ct, ctinfo, skb, timeouts[UDP_CT_UNREPLIED]);
 	}
+
+	if (udp_ts_reply(ct, dir))
+		nf_ct_refresh_acct(ct, ctinfo, skb, extra);
+
 	return NF_ACCEPT;
 }
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-10-21 13:53               ` Florian Westphal
@ 2024-10-23  9:03                 ` Benny Lyne Amorsen
  2024-11-10 21:50                 ` Florian Westphal
  1 sibling, 0 replies; 23+ messages in thread
From: Benny Lyne Amorsen @ 2024-10-23  9:03 UTC (permalink / raw)
  To: netfilter

Florian Westphal <fw@strlen.de> writes:

> Here is a better patch, renew only when responses are seen.
> This means that once either initiator or responder ceases to send
> packets entry will time out.

It is common to syslog using UDP without having a response. It seems like this
will allocate a new NAT entry every so often, changing the source port
of any SNATted syslog. This in turn will mean extra sessions on any
other firewalls the traffic might go through.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-10-21 13:53               ` Florian Westphal
  2024-10-23  9:03                 ` Benny Lyne Amorsen
@ 2024-11-10 21:50                 ` Florian Westphal
  2024-11-11  6:33                   ` Antonio Ojea
  1 sibling, 1 reply; 23+ messages in thread
From: Florian Westphal @ 2024-11-10 21:50 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Antonio Ojea, Pablo Neira Ayuso, netfilter

Florian Westphal <fw@strlen.de> wrote:
> Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> > On Fri, 18 Oct 2024 at 13:33, Florian Westphal <fw@strlen.de> wrote:
> > > Same as what happens now, 2nd packet follows NAT mapping of first one.
> > 
> > This looks like the way to go ... if you can send me a patch I can do
> > some testing next week and report back
> 
> Here is a better patch, renew only when responses are seen.
> This means that once either initiator or responder ceases to send
> packets entry will time out.
> 
> Subject: netfilter: nf_conntrack_proto_udp: renew timeout only for bidirectional traffic

Ping.  Did you have a chance to test this?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-11-10 21:50                 ` Florian Westphal
@ 2024-11-11  6:33                   ` Antonio Ojea
  0 siblings, 0 replies; 23+ messages in thread
From: Antonio Ojea @ 2024-11-11  6:33 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter

On Sun, 10 Nov 2024 at 14:50, Florian Westphal <fw@strlen.de> wrote:
>
> Florian Westphal <fw@strlen.de> wrote:
> > Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> > > On Fri, 18 Oct 2024 at 13:33, Florian Westphal <fw@strlen.de> wrote:
> > > > Same as what happens now, 2nd packet follows NAT mapping of first one.
> > >
> > > This looks like the way to go ... if you can send me a patch I can do
> > > some testing next week and report back
> >
> > Here is a better patch, renew only when responses are seen.
> > This means that once either initiator or responder ceases to send
> > packets entry will time out.
> >
> > Subject: netfilter: nf_conntrack_proto_udp: renew timeout only for bidirectional traffic
>
> Ping.  Did you have a chance to test this?

Sorry, I'm out all this week on Kubecon and previous weeks there was a
lot of work with the kubernetes release.

I will try to do it the week of 25t of Nov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-10-17 22:10     ` Antonio Ojea
  2024-10-17 23:30       ` Florian Westphal
@ 2024-11-11 12:06       ` Pablo Neira Ayuso
  2024-11-11 12:09         ` Florian Westphal
  1 sibling, 1 reply; 23+ messages in thread
From: Pablo Neira Ayuso @ 2024-11-11 12:06 UTC (permalink / raw)
  To: Antonio Ojea; +Cc: Florian Westphal, netfilter

Hi Antonio,

On Thu, Oct 17, 2024 at 11:10:02PM +0100, Antonio Ojea wrote:
> On Thu, 17 Oct 2024 at 17:36, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >
> > On Thu, Oct 17, 2024 at 02:46:32PM +0200, Florian Westphal wrote:
> > > Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> > > > In the context of Kubernetes, when DNATing entries for UDP Services,
> > > > we need to deal with some edge cases where some UDP entries are left
> > > > orphaned but blackhole the traffic to the new endpoints.
> > > >
> > > > At high level, the scenario is:
> > > > - Client IP_A sends UDP traffic to VirtualIP IP_B --> Kubernetes
> > > > Translates this to Endpoint IP_C
> > > > - Endpoint IP_C is replaced by Endpoint IP_D, but since Client IP_A
> > > > does not stop sending traffic, the conntrack entry IP_A IP_B --> IP_C
> > > > takes precedence and is being renewed, so traffic is not sent to the
> > > > new Endpoint IP_D and is lost.
> > > >
> > > > To solve this problem, we have some heuristics to detect those
> > > > scenarios when the endpoints change and flush the conntrack entries,
> > > > however, since this is event based, if we lost the event that
> > > > triggered the problem or something happens that fails to clean up the
> > > > entry,  the user need to manually flush the entries.

You can still stick to the event approach, then resort to
resync/reconcile loop when userspace gets a report that events are
getting lost, ie. hybrid approach.

> > > > We are implementing a new approach to solve this, we list all the UDP
> > > > conntrack entries using netlink, compare against the existing
> > > > programmed nftables/iptables rules, and flush the ones we know are
> > > > stale.
> > > >
> > > > During the implementation review, the question [1] this raises is, how
> > > > impactful is it to dump all the conntrack entries each time we program
> > > > the iptables/nftables rules (this can be every 1s on nodes with a lot
> > > > of entries)?
> > > > Is this approach completely safe?
> > > > Should we try to read from procfs instead?
> > >
> > > Walking all conntrack entries in 1s intervals is going to be slow, no
> > > matter the chosen interface.  Even doing the filtering in the kernel to
> > > not dump all entries but only those that match udp/port/ip criteria is
> > > not going to change it.
> 
> We are not worried about being slow in the order of seconds, the
> system is eventually consistent so there can always be a reasonable
> latency.
> Since we only care about UDP, losing packets during that period is not
> desirable but is assumable.
> My main concern is if constantly dumping all the entries via netlink
> can cause any issue or increase resources consumption.
>
> > >
> > > Also both proc and netlink dumps can miss entries (albeit its rare),
> > > if parallel insertions/deletes happen (which is normal on busy system).
> > >
> 
> That is one of the reasons we want to implement this reconcile loop,
> so it can be resilient to this kind of errors, we keep the state on
> the API in the control plane, so we can always rebuild the state in
> the dataplane (recreating nftables rules and delete conntrack entries
> that does not match the current state)
> 
> > > I wonder why the appropriate delete requests cannot be done when the
> > > mapping is altered, I mean, you must have some code that issues
> > > either iptables -t nat -D ... or nft delete element ... or similar.
> > >
> > > If you do that, why not also fire off the conntrack -D request
> > > afterwards?  Or are these publish/withdraw so frequent that this
> > > doesn't matter compared to poll based approach?
> > >
> > > Something like
> > >    conntrack -D --protonum 17 --orig-dst $vserver --orig-port-dst 53 --reply-src $rserver --reply-port-src 5353
> > >
> > > would zap everything to $rserver mapped to $vserver from client point of view.
> >
> 
> This is how it is implemented today and it works, but it does not
> handle process restarts per example, or is not resilient to errors.
> The implementation is also much more complex because we need to
> implement all the possible edge cases that can leave stale entries

It should also be possible to shrink timeouts on restart via conntrack -U
which would be similar to the approach that Florian is proposing, but from
control plane rather than updating existing UDP timeout policy.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-11-11 12:06       ` Pablo Neira Ayuso
@ 2024-11-11 12:09         ` Florian Westphal
  2024-11-11 12:29           ` Pablo Neira Ayuso
  0 siblings, 1 reply; 23+ messages in thread
From: Florian Westphal @ 2024-11-11 12:09 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Antonio Ojea, Florian Westphal, netfilter

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > This is how it is implemented today and it works, but it does not
> > handle process restarts per example, or is not resilient to errors.
> > The implementation is also much more complex because we need to
> > implement all the possible edge cases that can leave stale entries
> 
> It should also be possible to shrink timeouts on restart via conntrack -U
> which would be similar to the approach that Florian is proposing, but from
> control plane rather than updating existing UDP timeout policy.

The time and effort needed to make something as basic as NAT
work properly is jus silly.

Lets fix conntrack so this "just works".

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-11-11 12:09         ` Florian Westphal
@ 2024-11-11 12:29           ` Pablo Neira Ayuso
  2024-11-11 12:54             ` Florian Westphal
  0 siblings, 1 reply; 23+ messages in thread
From: Pablo Neira Ayuso @ 2024-11-11 12:29 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Antonio Ojea, netfilter

On Mon, Nov 11, 2024 at 01:09:46PM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > This is how it is implemented today and it works, but it does not
> > > handle process restarts per example, or is not resilient to errors.
> > > The implementation is also much more complex because we need to
> > > implement all the possible edge cases that can leave stale entries
> > 
> > It should also be possible to shrink timeouts on restart via conntrack -U
> > which would be similar to the approach that Florian is proposing, but from
> > control plane rather than updating existing UDP timeout policy.
> 
> The time and effort needed to make something as basic as NAT
> work properly is jus silly.
> 
> Lets fix conntrack so this "just works".

Ok, then...

+static bool udp_ts_reply(struct nf_conn *ct, enum ip_conntrack_dir dir)
+{
+       bool is_reply = READ_ONCE(ct->proto.udp.last_dir) != dir;
+
+       if (is_reply)
+               WRITE_ONCE(ct->proto.udp.last_dir, dir);
+
+       return is_reply;
+}

... if packet in the other direction is seen, then...

+       if (udp_ts_reply(ct, dir))
+               nf_ct_refresh_acct(ct, ctinfo, skb, extra);

... conntrack entry is refreshed?

Will this work for, let's say, RTP traffic which goes over UDP and it
is unidirectional? Well, maybe you could occasionally see a RCTP
packet as reply to get statistics, but those could just not be
available.

I am not sure we can make assumptions on the direction like this, any
application protocol could run over UDP.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-11-11 12:29           ` Pablo Neira Ayuso
@ 2024-11-11 12:54             ` Florian Westphal
  2024-11-12  9:16               ` Pablo Neira Ayuso
  0 siblings, 1 reply; 23+ messages in thread
From: Florian Westphal @ 2024-11-11 12:54 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, Antonio Ojea, netfilter

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Mon, Nov 11, 2024 at 01:09:46PM +0100, Florian Westphal wrote:
> > The time and effort needed to make something as basic as NAT
> > work properly is jus silly.
> > 
> > Lets fix conntrack so this "just works".
> 
> Ok, then...
> 
> +static bool udp_ts_reply(struct nf_conn *ct, enum ip_conntrack_dir dir)
> +{
> +       bool is_reply = READ_ONCE(ct->proto.udp.last_dir) != dir;
> +
> +       if (is_reply)
> +               WRITE_ONCE(ct->proto.udp.last_dir, dir);
> +
> +       return is_reply;
> +}
> 
> ... if packet in the other direction is seen, then...
> 
> +       if (udp_ts_reply(ct, dir))
> +               nf_ct_refresh_acct(ct, ctinfo, skb, extra);
> 
> ... conntrack entry is refreshed?

Yes.

> Will this work for, let's say, RTP traffic which goes over UDP and it
> is unidirectional? Well, maybe you could occasionally see a RCTP
> packet as reply to get statistics, but those could just not be
> available.

We could add a || ct->master to the is_reply test.

> I am not sure we can make assumptions on the direction like this, any
> application protocol could run over UDP.

What about adding a CT template option to control the behaviour?

More work, but would avoid any compat concerns.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-11-11 12:54             ` Florian Westphal
@ 2024-11-12  9:16               ` Pablo Neira Ayuso
  2024-11-12  9:20                 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 23+ messages in thread
From: Pablo Neira Ayuso @ 2024-11-12  9:16 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Antonio Ojea, netfilter

On Mon, Nov 11, 2024 at 01:54:56PM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Mon, Nov 11, 2024 at 01:09:46PM +0100, Florian Westphal wrote:
> > > The time and effort needed to make something as basic as NAT
> > > work properly is jus silly.
> > > 
> > > Lets fix conntrack so this "just works".
> > 
> > Ok, then...
> > 
> > +static bool udp_ts_reply(struct nf_conn *ct, enum ip_conntrack_dir dir)
> > +{
> > +       bool is_reply = READ_ONCE(ct->proto.udp.last_dir) != dir;
> > +
> > +       if (is_reply)
> > +               WRITE_ONCE(ct->proto.udp.last_dir, dir);
> > +
> > +       return is_reply;
> > +}
> > 
> > ... if packet in the other direction is seen, then...
> > 
> > +       if (udp_ts_reply(ct, dir))
> > +               nf_ct_refresh_acct(ct, ctinfo, skb, extra);
> > 
> > ... conntrack entry is refreshed?
> 
> Yes.
> 
> > Will this work for, let's say, RTP traffic which goes over UDP and it
> > is unidirectional? Well, maybe you could occasionally see a RCTP
> > packet as reply to get statistics, but those could just not be
> > available.
> 
> We could add a || ct->master to the is_reply test.

Assuming SIP helper is in place, then yes.

> > I am not sure we can make assumptions on the direction like this, any
> > application protocol could run over UDP.
> 
> What about adding a CT template option to control the behaviour?

Maybe custom ct timeout policy can help instead? Or even extend the
timeout policy to support for the behaviour you want to put in place
(refresh timer only when packets are seen in both directions).

If user knows what application protocol runs over UDP in just port,
then they can define finer grain timeout policies accordingly.

> More work, but would avoid any compat concerns.

Agreed.

The UDP conntracker is already making assumptions by handling UDP
traffic as "stateful" based on default timeouts that were define back
in 1999 and that has been adjusted several times in the past.

Shrinking too much the timeouts could also lead to releasing NAT too
early, hence removing the NAT mapping too soon, it is hard to know the
implications of this wrt. to the application protocol.

Maybe Antonio can extend the requirements stub he provides, he
mentions the following scenarios:

- Conntrack entry removal for backends that are gone. Probably
  speeding up conntrack -D with the new CTA_FILTER support is
  sufficient to improve the situation. IIRC, a user reported from
  3 seconds to 0.5 to delete million of entries via CTA_FILTER.

- Service restart, ie. "reconcile" scenario, I think this is harder
  because IIUC this means userspace needs to compare the current
  configuration with the conntrack entries in the table and purge
  those that are stale?

I guess the concern is that assured flows cannot be expelled from the
conntrack table via early_drop, that is why an expedite cleanup is
important?

Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-11-12  9:16               ` Pablo Neira Ayuso
@ 2024-11-12  9:20                 ` Pablo Neira Ayuso
  2024-11-12 14:41                   ` Antonio Ojea
  0 siblings, 1 reply; 23+ messages in thread
From: Pablo Neira Ayuso @ 2024-11-12  9:20 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Antonio Ojea, netfilter

On Tue, Nov 12, 2024 at 10:16:45AM +0100, Pablo Neira Ayuso wrote:
> I guess the concern is that assured flows cannot be expelled from the
> conntrack table via early_drop, that is why an expedite cleanup is
> important?

Actually, the issue is that packets could end up in a backend which
does not exist after re-configuration, therefore, removing the entry
need to happen so ongoing flow have a chance to talk to another
(different) backend.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-11-12  9:20                 ` Pablo Neira Ayuso
@ 2024-11-12 14:41                   ` Antonio Ojea
  2024-11-12 14:43                     ` Antonio Ojea
  2024-11-12 16:18                     ` Florian Westphal
  0 siblings, 2 replies; 23+ messages in thread
From: Antonio Ojea @ 2024-11-12 14:41 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter

[-- Attachment #1: Type: text/plain, Size: 888 bytes --]

On Tue, 12 Nov 2024 at 02:20, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>
> On Tue, Nov 12, 2024 at 10:16:45AM +0100, Pablo Neira Ayuso wrote:
> > I guess the concern is that assured flows cannot be expelled from the
> > conntrack table via early_drop, that is why an expedite cleanup is
> > important?
>
> Actually, the issue is that packets could end up in a backend which
> does not exist after re-configuration, therefore, removing the entry
> need to happen so ongoing flow have a chance to talk to another
> (different) backend.

Please take a look to this kselftest attached that emulates the
problematic behavior in kubernetes,

I think that in UDP the nat rule should take precedence over the
conntrack entry,on the contrary to TCP where it is important to
preserve the session if it has been established.

I  did only a quick test and seems to fail also with Florian patch

[-- Attachment #2: kselftest_conntrack_udp.sh --]
[-- Type: text/x-sh, Size: 24325 bytes --]

diff --git a/tools/testing/selftests/net/netfilter/Makefile b/tools/testing/selftests/net/netfilter/Makefile
index ffe161fac8b5..9ba5610f3ebc 100644
--- a/tools/testing/selftests/net/netfilter/Makefile
+++ b/tools/testing/selftests/net/netfilter/Makefile
@@ -12,6 +12,7 @@ TEST_PROGS += conntrack_dump_flush.sh
 TEST_PROGS += conntrack_icmp_related.sh
 TEST_PROGS += conntrack_ipip_mtu.sh
 TEST_PROGS += conntrack_tcp_unreplied.sh
+TEST_PROGS += conntrack_udp_expires.sh
 TEST_PROGS += conntrack_sctp_collision.sh
 TEST_PROGS += conntrack_vrf.sh
 TEST_PROGS += conntrack_reverse_clash.sh
diff --git a/tools/testing/selftests/net/netfilter/conntrack_udp_expires.sh b/tools/testing/selftests/net/netfilter/conntrack_udp_expires.sh
new file mode 100755
index 000000000000..928f221ae739
--- /dev/null
+++ b/tools/testing/selftests/net/netfilter/conntrack_udp_expires.sh
@@ -0,0 +1,249 @@
+#!/bin/bash
+#
+# This tests conntrack on the following scenario:
+#
+#                         +------------+
+# +-------+               |  nsrouter  |                  +-------+
+# |ns1    |.99          .1|            |.1             .99|    ns2|
+# |   eth0|---------------|veth0  veth1|------------------|eth0   |
+# |       |  10.0.1.0/24  |            |   10.0.2.0/24    |       |
+# +-------+  dead:1::/64  |    veth2   |   dead:2::/64    +-------+
+#                         +------------+
+#                                |.1
+#                                |
+#                                |
+#                                |                        +-------+
+#                                |                     .99|    ns3|
+#                                +------------------------|eth0   |
+#                                       10.0.3.0/24       |       |
+#                                       dead:3::/64       +-------+
+#
+# nsrouters implement loadbalancing using DNAT with a virtual IP
+# 10.0.4.10 - dead:4::a
+# shellcheck disable=SC2162,SC2317
+
+source lib.sh
+ret=0
+# UDP is slow
+timeout=15
+
+cleanup()
+{
+	ip netns pids "$ns1" | xargs kill 2>/dev/null
+	ip netns pids "$ns2" | xargs kill 2>/dev/null
+	ip netns pids "$ns3" | xargs kill 2>/dev/null
+	ip netns pids "$nsrouter" | xargs kill 2>/dev/null
+
+	cleanup_all_ns
+}
+
+checktool "nft --version" "test without nft tool"
+checktool "socat -h" "run test without socat"
+
+trap cleanup EXIT
+setup_ns ns1 ns2 ns3 nsrouter
+
+if ! ip link add veth0 netns "$nsrouter" type veth peer name eth0 netns "$ns1" > /dev/null 2>&1; then
+    echo "SKIP: No virtual ethernet pair device support in kernel"
+    exit $ksft_skip
+fi
+ip link add veth1 netns "$nsrouter" type veth peer name eth0 netns "$ns2"
+ip link add veth2 netns "$nsrouter" type veth peer name eth0 netns "$ns3"
+
+ip -net "$nsrouter" link set veth0 up
+ip -net "$nsrouter" addr add 10.0.1.1/24 dev veth0
+ip -net "$nsrouter" addr add dead:1::1/64 dev veth0 nodad
+
+ip -net "$nsrouter" link set veth1 up
+ip -net "$nsrouter" addr add 10.0.2.1/24 dev veth1
+ip -net "$nsrouter" addr add dead:2::1/64 dev veth1 nodad
+
+ip -net "$nsrouter" link set veth2 up
+ip -net "$nsrouter" addr add 10.0.3.1/24 dev veth2
+ip -net "$nsrouter" addr add dead:3::1/64 dev veth2 nodad
+
+ip -net "$ns1" link set eth0 up
+ip -net "$ns2" link set eth0 up
+ip -net "$ns3" link set eth0 up
+
+ip -net "$ns1" addr add 10.0.1.99/24 dev eth0
+ip -net "$ns1" addr add dead:1::99/64 dev eth0 nodad
+ip -net "$ns1" route add default via 10.0.1.1
+ip -net "$ns1" route add default via dead:1::1
+
+ip -net "$ns2" addr add 10.0.2.99/24 dev eth0
+ip -net "$ns2" addr add dead:2::99/64 dev eth0 nodad
+ip -net "$ns2" route add default via 10.0.2.1
+ip -net "$ns2" route add default via dead:2::1
+
+ip -net "$ns3" addr add 10.0.3.99/24 dev eth0
+ip -net "$ns3" addr add dead:3::99/64 dev eth0 nodad
+ip -net "$ns3" route add default via 10.0.3.1
+ip -net "$ns3" route add default via dead:3::1
+
+ip netns exec "$nsrouter" sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth0.forwarding=1 > /dev/null
+ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth1.forwarding=1 > /dev/null
+ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth2.forwarding=1 > /dev/null
+
+test_ping() {
+  if ! ip netns exec "$ns1" ping -c 1 -q 10.0.2.99 > /dev/null; then
+	return 1
+  fi
+
+  if ! ip netns exec "$ns1" ping -c 1 -q dead:2::99 > /dev/null; then
+	return 2
+  fi
+
+  if ! ip netns exec "$ns1" ping -c 1 -q 10.0.3.99 > /dev/null; then
+	return 1
+  fi
+
+  if ! ip netns exec "$ns1" ping -c 1 -q dead:3::99 > /dev/null; then
+	return 2
+  fi
+
+  return 0
+}
+
+test_ping_router() {
+  if ! ip netns exec "$ns1" ping -c 1 -q 10.0.2.1 > /dev/null; then
+	return 3
+  fi
+
+  if ! ip netns exec "$ns1" ping -c 1 -q dead:2::1 > /dev/null; then
+	return 4
+  fi
+
+  return 0
+}
+
+
+listener_ready()
+{
+	local ns="$1"
+	local port="$2"
+	local proto="$3"
+	ss -N "$ns" -ln "$proto" -o "sport = :$port" | grep -q "$port"
+}
+
+test_conntrack_udp_expires()
+{
+	local ip_proto="$1"
+	# derived variables
+	local testname="test_${ip_proto}_udp_forward"
+	local socat_ipproto
+	local vip
+	local ns2_ip
+	local ns3_ip
+	local ns2_ip_port
+	local ns3_ip_port
+
+	# socat 1.8.0 has a bug that requires to specify the IP family to bind (fixed in 1.8.0.1)
+	case $ip_proto in
+	"ip")
+		socat_ipproto="-4"
+		vip=10.0.4.10
+		ns2_ip=10.0.2.99
+		ns3_ip=10.0.3.99
+		vip_ip_port="$vip:8080"
+		ns2_ip_port="$ns2_ip:8080"
+		ns3_ip_port="$ns3_ip:8080"
+	;;
+	"ip6")
+		socat_ipproto="-6"
+		vip=dead:4::a
+		ns2_ip=dead:2::99
+		ns3_ip=dead:3::99
+		vip_ip_port="[$vip]:8080"
+		ns2_ip_port="[$ns2_ip]:8080"
+		ns3_ip_port="[$ns3_ip]:8080"
+	;;
+	*)
+	echo "FAIL: unsupported protocol"
+	exit 255
+	;;
+	esac
+
+	ip netns exec "$nsrouter" nft -f /dev/stdin <<EOF
+flush ruleset
+table inet nat {
+	chain divert {
+		type nat hook prerouting priority 0; policy accept;
+		$ip_proto daddr $vip udp dport 8080 dnat to $ns2_ip_port
+	}
+}
+EOF
+
+	timeout "$timeout" ip netns exec "$ns2" socat "$socat_ipproto" udp-listen:8080,fork SYSTEM:"echo PONG_NS2" 2>/dev/null &
+	local server2_pid=$!
+
+	timeout "$timeout" ip netns exec "$ns3" socat "$socat_ipproto" udp-listen:8080,fork SYSTEM:"echo PONG_NS3" 2>/dev/null &
+	local server3_pid=$!
+
+	busywait "$BUSYWAIT_TIMEOUT" listener_ready "$ns2" 8080 "-u"
+	busywait "$BUSYWAIT_TIMEOUT" listener_ready "$ns3" 8080 "-u"
+
+	local result
+	# request from ns1 to ns2 (direct traffic)
+	result=$(echo PING | ip netns exec "$ns1" socat -t 2 -T 2 STDIO udp:"$ns2_ip_port",sourceport=18888)
+	if [ "$result" == "PONG_NS2" ] ;then
+		echo "PASS: conntrack udp test $testname: ns1 got reply \"$result\" connecting to ns2"
+	else
+		echo "ERROR: conntrack udp test $testname: ns1 got reply \"$result\" connecting to ns2, not \"PONG_NS2\" as intended"
+		ret=1
+	fi
+
+	# request from ns1 to ns3 (direct traffic)
+	result=$(echo PING | ip netns exec "$ns1" socat -t 2 -T 2 STDIO udp:"$ns3_ip_port",sourceport=18888)
+	if [ "$result" = "PONG_NS3" ] ;then
+		echo "PASS: conntrack udp test $testname: ns1 got reply \"$result\" connecting to ns3"
+	else
+		echo "ERROR: conntrack udp test $testname: ns1 got reply \"$result\" connecting to ns3, not \"PONG_NS3\" as intended"
+		ret=1
+	fi
+
+	# request from ns1 to vip (DNAT to ns2)
+	result=$(echo PING | ip netns exec "$ns1" socat -t 2 -T 2 STDIO udp:"$vip_ip_port",sourceport=18888)
+	if [ "$result" = "PONG_NS2" ] ;then
+		echo "PASS: conntrack udp test $testname: ns1 got reply \"$result\" connecting to vip (ns2)"
+	else
+		echo "ERROR: conntrack udp test $testname: ns1 got reply \"$result\" connecting to vip, not \"PONG_NS2\" as intended"
+		ret=1
+	fi
+
+	# replace the DNAT rule to direct and replace ns2 destination with ns3
+		ip netns exec "$nsrouter" nft -f /dev/stdin <<EOF
+flush ruleset
+table inet nat {
+	chain divert {
+		type nat hook prerouting priority 0; policy accept;
+		$ip_proto daddr $vip udp dport 8080 dnat to $ns3_ip_port
+	}
+}
+EOF
+	# request from ns1 to vip (DNAT to ns3)
+	# reuse the same port to validate the existing conntrack entry does not
+	# shadow the actual nftables rule.
+	result=$(echo PING | ip netns exec "$ns1" socat -t 2 -T 2 STDIO udp:"$vip_ip_port",sourceport=18888)
+	if [ "$result" = "PONG_NS3" ] ;then
+		echo "PASS: conntrack udp test $testname: ns1 got reply \"$result\" connecting to vip (ns3)"
+	else
+		echo "ERROR: conntrack udp test $testname: ns1 got reply \"$result\" connecting to vip, not \"PONG_NS3\" as intended"
+		ret=1
+	fi
+}
+
+
+if test_ping; then
+	# queue bypass works (rules were skipped, no listener)
+	echo "PASS: ${ns1} can reach ${ns2}"
+else
+	echo "FAIL: ${ns1} cannot reach ${ns2}: $ret" 1>&2
+	exit $ret
+fi
+
+test_conntrack_udp_expires "ip"
+test_conntrack_udp_expires "ip6"
+
+exit $ret
diff --git a/tools/testing/selftests/net/netfilter/conntrack_vrf.sh b/tools/testing/selftests/net/netfilter/conntrack_vrf.sh
index e95ecb37c2b1..77413cc11124 100755
--- a/tools/testing/selftests/net/netfilter/conntrack_vrf.sh
+++ b/tools/testing/selftests/net/netfilter/conntrack_vrf.sh
@@ -1,253 +1,249 @@
 #!/bin/bash
-
-# This script demonstrates interaction of conntrack and vrf.
-# The vrf driver calls the netfilter hooks again, with oif/iif
-# pointing at the VRF device.
-#
-# For ingress, this means first iteration has iifname of lower/real
-# device.  In this script, thats veth0.
-# Second iteration is iifname set to vrf device, tvrf in this script.
-#
-# For egress, this is reversed: first iteration has the vrf device,
-# second iteration is done with the lower/real/veth0 device.
-#
-# test_ct_zone_in demonstrates unexpected change of nftables
-# behavior # caused by commit 09e856d54bda5f28 "vrf: Reset skb conntrack
-# connection on VRF rcv"
 #
-# It was possible to assign conntrack zone to a packet (or mark it for
-# `notracking`) in the prerouting chain before conntrack, based on real iif.
+# This tests conntrack on the following scenario:
 #
-# After the change, the zone assignment is lost and the zone is assigned based
-# on the VRF master interface (in case such a rule exists).
-# assignment is lost. Instead, assignment based on the `iif` matching
-# Thus it is impossible to distinguish packets based on the original
-# interface.
+#                         +------------+
+# +-------+               |  nsrouter  |                  +-------+
+# |ns1    |.99          .1|            |.1             .99|    ns2|
+# |   eth0|---------------|veth0  veth1|------------------|eth0   |
+# |       |  10.0.1.0/24  |            |   10.0.2.0/24    |       |
+# +-------+  dead:1::/64  |    veth2   |   dead:2::/64    +-------+
+#                         +------------+
+#                                |.1
+#                                |
+#                                |
+#                                |                        +-------+
+#                                |                     .99|    ns3|
+#                                +------------------------|eth0   |
+#                                       10.0.3.0/24       |       |
+#                                       dead:3::/64       +-------+
 #
-# test_masquerade_vrf and test_masquerade_veth0 demonstrate the problem
-# that was supposed to be fixed by the commit mentioned above to make sure
-# that any fix to test case 1 won't break masquerade again.
+# nsrouters implement loadbalancing using DNAT with a virtual IP
+# 10.0.4.10 - dead:4::a
+# shellcheck disable=SC2162,SC2317
 
 source lib.sh
-
-IP0=172.30.30.1
-IP1=172.30.30.2
-DUMMYNET=10.9.9
-PFXL=30
 ret=0
+# UDP is slow
+timeout=15
 
 cleanup()
 {
-	ip netns pids $ns0 | xargs kill 2>/dev/null
-	ip netns pids $ns1 | xargs kill 2>/dev/null
+	ip netns pids "$ns1" | xargs kill 2>/dev/null
+	ip netns pids "$ns2" | xargs kill 2>/dev/null
+	ip netns pids "$ns3" | xargs kill 2>/dev/null
+	ip netns pids "$nsrouter" | xargs kill 2>/dev/null
 
 	cleanup_all_ns
 }
 
-checktool "nft --version" "run test without nft"
-checktool "conntrack --version" "run test without conntrack"
+checktool "nft --version" "test without nft tool"
 checktool "socat -h" "run test without socat"
 
 trap cleanup EXIT
+setup_ns ns1 ns2 ns3 nsrouter
 
-setup_ns ns0 ns1
-
-ip netns exec "$ns0" sysctl -q -w net.ipv4.conf.default.rp_filter=0
-ip netns exec "$ns0" sysctl -q -w net.ipv4.conf.all.rp_filter=0
-ip netns exec "$ns0" sysctl -q -w net.ipv4.conf.all.rp_filter=0
-ip netns exec "$ns0" sysctl -q -w net.ipv4.conf.all.forwarding=1
-
-if ! ip link add veth0 netns "$ns0" type veth peer name veth0 netns "$ns1" > /dev/null 2>&1; then
-	echo "SKIP: Could not add veth device"
-	exit $ksft_skip
+if ! ip link add veth0 netns "$nsrouter" type veth peer name eth0 netns "$ns1" > /dev/null 2>&1; then
+    echo "SKIP: No virtual ethernet pair device support in kernel"
+    exit $ksft_skip
 fi
+ip link add veth1 netns "$nsrouter" type veth peer name eth0 netns "$ns2"
+ip link add veth2 netns "$nsrouter" type veth peer name eth0 netns "$ns3"
+
+ip -net "$nsrouter" link set veth0 up
+ip -net "$nsrouter" addr add 10.0.1.1/24 dev veth0
+ip -net "$nsrouter" addr add dead:1::1/64 dev veth0 nodad
+
+ip -net "$nsrouter" link set veth1 up
+ip -net "$nsrouter" addr add 10.0.2.1/24 dev veth1
+ip -net "$nsrouter" addr add dead:2::1/64 dev veth1 nodad
+
+ip -net "$nsrouter" link set veth2 up
+ip -net "$nsrouter" addr add 10.0.3.1/24 dev veth2
+ip -net "$nsrouter" addr add dead:3::1/64 dev veth2 nodad
+
+ip -net "$ns1" link set eth0 up
+ip -net "$ns2" link set eth0 up
+ip -net "$ns3" link set eth0 up
+
+ip -net "$ns1" addr add 10.0.1.99/24 dev eth0
+ip -net "$ns1" addr add dead:1::99/64 dev eth0 nodad
+ip -net "$ns1" route add default via 10.0.1.1
+ip -net "$ns1" route add default via dead:1::1
+
+ip -net "$ns2" addr add 10.0.2.99/24 dev eth0
+ip -net "$ns2" addr add dead:2::99/64 dev eth0 nodad
+ip -net "$ns2" route add default via 10.0.2.1
+ip -net "$ns2" route add default via dead:2::1
+
+ip -net "$ns3" addr add 10.0.3.99/24 dev eth0
+ip -net "$ns3" addr add dead:3::99/64 dev eth0 nodad
+ip -net "$ns3" route add default via 10.0.3.1
+ip -net "$ns3" route add default via dead:3::1
+
+ip netns exec "$nsrouter" sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth0.forwarding=1 > /dev/null
+ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth1.forwarding=1 > /dev/null
+ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth2.forwarding=1 > /dev/null
+
+test_ping() {
+  if ! ip netns exec "$ns1" ping -c 1 -q 10.0.2.99 > /dev/null; then
+	return 1
+  fi
+
+  if ! ip netns exec "$ns1" ping -c 1 -q dead:2::99 > /dev/null; then
+	return 2
+  fi
+
+  if ! ip netns exec "$ns1" ping -c 1 -q 10.0.3.99 > /dev/null; then
+	return 1
+  fi
+
+  if ! ip netns exec "$ns1" ping -c 1 -q dead:3::99 > /dev/null; then
+	return 2
+  fi
+
+  return 0
+}
 
-if ! ip -net "$ns0" li add tvrf type vrf table 9876; then
-	echo "SKIP: Could not add vrf device"
-	exit $ksft_skip
-fi
+test_ping_router() {
+  if ! ip netns exec "$ns1" ping -c 1 -q 10.0.2.1 > /dev/null; then
+	return 3
+  fi
 
-ip -net "$ns0" link add dummy0 type dummy
+  if ! ip netns exec "$ns1" ping -c 1 -q dead:2::1 > /dev/null; then
+	return 4
+  fi
 
-ip -net "$ns0" li set veth0 master tvrf
-ip -net "$ns0" li set dummy0 master tvrf
-ip -net "$ns0" li set tvrf up
-ip -net "$ns0" li set veth0 up
-ip -net "$ns0" li set dummy0 up
-ip -net "$ns1" li set veth0 up
+  return 0
+}
 
-ip -net "$ns0" addr add $IP0/$PFXL dev veth0
-ip -net "$ns1" addr add $IP1/$PFXL dev veth0
-ip -net "$ns0" addr add $DUMMYNET.1/$PFXL dev dummy0
 
 listener_ready()
 {
-        local ns="$1"
-
-        ss -N "$ns" -l -n -t -o "sport = :55555" | grep -q "55555"
+	local ns="$1"
+	local port="$2"
+	local proto="$3"
+	ss -N "$ns" -ln "$proto" -o "sport = :$port" | grep -q "$port"
 }
 
-ip netns exec "$ns1" socat -u -4 TCP-LISTEN:55555,reuseaddr,fork STDOUT > /dev/null &
-busywait $BUSYWAIT_TIMEOUT listener_ready "$ns1"
-
-# test vrf ingress handling.
-# The incoming connection should be placed in conntrack zone 1,
-# as decided by the first iteration of the ruleset.
-test_ct_zone_in()
+test_conntrack_udp_expires()
 {
-ip netns exec "$ns0" nft -f - <<EOF
-table testct {
-	chain rawpre {
-		type filter hook prerouting priority raw;
-
-		iif { veth0, tvrf } counter meta nftrace set 1
-		iif veth0 counter ct zone set 1 counter return
-		iif tvrf counter ct zone set 2 counter return
-		ip protocol icmp counter
-		notrack counter
-	}
-
-	chain rawout {
-		type filter hook output priority raw;
-
-		oif veth0 counter ct zone set 1 counter return
-		oif tvrf counter ct zone set 2 counter return
-		notrack counter
+	local ip_proto="$1"
+	# derived variables
+	local testname="test_${ip_proto}_udp_forward"
+	local socat_ipproto
+	local vip
+	local ns2_ip
+	local ns3_ip
+	local ns2_ip_port
+	local ns3_ip_port
+
+	# socat 1.8.0 has a bug that requires to specify the IP family to bind (fixed in 1.8.0.1)
+	case $ip_proto in
+	"ip")
+		socat_ipproto="-4"
+		vip=10.0.4.10
+		ns2_ip=10.0.2.99
+		ns3_ip=10.0.3.99
+		vip_ip_port="$vip:8080"
+		ns2_ip_port="$ns2_ip:8080"
+		ns3_ip_port="$ns3_ip:8080"
+	;;
+	"ip6")
+		socat_ipproto="-6"
+		vip=dead:4::a
+		ns2_ip=dead:2::99
+		ns3_ip=dead:3::99
+		vip_ip_port="[$vip]:8080"
+		ns2_ip_port="[$ns2_ip]:8080"
+		ns3_ip_port="[$ns3_ip]:8080"
+	;;
+	*)
+	echo "FAIL: unsupported protocol"
+	exit 255
+	;;
+	esac
+
+	ip netns exec "$nsrouter" nft -f /dev/stdin <<EOF
+flush ruleset
+table inet filter {
+	chain divert {
+		type filter hook prerouting priority 0; policy accept;
+		$ip_proto daddr $vip udp dport 8080 dnat to $ns2_ip
 	}
 }
 EOF
-	ip netns exec "$ns1" ping -W 1 -c 1 -I veth0 "$IP0" > /dev/null
-
-	# should be in zone 1, not zone 2
-	count=$(ip netns exec "$ns0" conntrack -L -s $IP1 -d $IP0 -p icmp --zone 1 2>/dev/null | wc -l)
-	if [ "$count" -eq 1 ]; then
-		echo "PASS: entry found in conntrack zone 1"
-	else
-		echo "FAIL: entry not found in conntrack zone 1"
-		count=$(ip netns exec "$ns0" conntrack -L -s $IP1 -d $IP0 -p icmp --zone 2 2> /dev/null | wc -l)
-		if [ "$count" -eq 1 ]; then
-			echo "FAIL: entry found in zone 2 instead"
-		else
-			echo "FAIL: entry not in zone 1 or 2, dumping table"
-			ip netns exec "$ns0" conntrack -L
-			ip netns exec "$ns0" nft list ruleset
-		fi
-	fi
-}
-
-# add masq rule that gets evaluated w. outif set to vrf device.
-# This tests the first iteration of the packet through conntrack,
-# oifname is the vrf device.
-test_masquerade_vrf()
-{
-	local qdisc=$1
 
-	if [ "$qdisc" != "default" ]; then
-		tc -net "$ns0" qdisc add dev tvrf root "$qdisc"
-	fi
+	timeout "$timeout" ip netns exec "$ns2" socat "$socat_ipproto" udp-listen:8080,fork SYSTEM:"echo PONG_NS2" 2>/dev/null &
+	local server2_pid=$!
 
-	ip netns exec "$ns0" conntrack -F 2>/dev/null
+	timeout "$timeout" ip netns exec "$ns3" socat "$socat_ipproto" udp-listen:8080,fork SYSTEM:"echo PONG_NS3" 2>/dev/null &
+	local server3_pid=$!
 
-ip netns exec "$ns0" nft -f - <<EOF
-flush ruleset
-table ip nat {
-	chain rawout {
-		type filter hook output priority raw;
+	busywait "$BUSYWAIT_TIMEOUT" listener_ready "$ns2" 8080 "-u"
+	busywait "$BUSYWAIT_TIMEOUT" listener_ready "$ns3" 8080 "-u"
 
-		oif tvrf ct state untracked counter
-	}
-	chain postrouting2 {
-		type filter hook postrouting priority mangle;
-
-		oif tvrf ct state untracked counter
-	}
-	chain postrouting {
-		type nat hook postrouting priority 0;
-		# NB: masquerade should always be combined with 'oif(name) bla',
-		# lack of this is intentional here, we want to exercise double-snat.
-		ip saddr 172.30.30.0/30 counter masquerade random
-	}
-}
-EOF
-	if ! ip netns exec "$ns0" ip vrf exec tvrf socat -u -4 STDIN TCP:"$IP1":55555 < /dev/null > /dev/null;then
-		echo "FAIL: connect failure with masquerade + sport rewrite on vrf device"
-		ret=1
-		return
-	fi
-
-	# must also check that nat table was evaluated on second (lower device) iteration.
-	if ip netns exec "$ns0" nft list table ip nat |grep -q 'counter packets 1' &&
-	   ip netns exec "$ns0" nft list table ip nat |grep -q 'untracked counter packets [1-9]'; then
-		echo "PASS: connect with masquerade + sport rewrite on vrf device ($qdisc qdisc)"
+	local result
+	# request from ns1 to ns2 (direct traffic)
+	result=$(echo PING | ip netns exec "$ns1" socat -t 2 -T 2 STDIO udp:"$ns2_ip_port",sourceport=18888)
+	if [ "$result" == "PONG_NS2" ] ;then
+		echo "PASS: tproxy test $testname: ns1 got reply \"$result\" connecting to ns2"
 	else
-		echo "FAIL: vrf rules have unexpected counter value"
+		echo "ERROR: tproxy test $testname: ns1 got reply \"$result\" connecting to ns2, not \"${expect_ns1_ns2}\" as intended"
 		ret=1
 	fi
 
-	if [ "$qdisc" != "default" ]; then
-		tc -net "$ns0" qdisc del dev tvrf root
-	fi
-}
-
-# add masq rule that gets evaluated w. outif set to veth device.
-# This tests the 2nd iteration of the packet through conntrack,
-# oifname is the lower device (veth0 in this case).
-test_masquerade_veth()
-{
-	ip netns exec "$ns0" conntrack -F 2>/dev/null
-ip netns exec "$ns0" nft -f - <<EOF
-flush ruleset
-table ip nat {
-	chain postrouting {
-		type nat hook postrouting priority 0;
-		meta oif veth0 ip saddr 172.30.30.0/30 counter masquerade random
-	}
-}
-EOF
-	if ! ip netns exec "$ns0" ip vrf exec tvrf socat -u -4 STDIN TCP:"$IP1":55555 < /dev/null > /dev/null;then
-		echo "FAIL: connect failure with masquerade + sport rewrite on veth device"
+	# request from ns1 to ns3 (direct traffic)
+	result=$(echo PING | ip netns exec "$ns1" socat -t 2 -T 2 STDIO udp:"$ns3_ip_port")
+	if [ "$result" = "PONG_NS3" ] ;then
+		echo "PASS: tproxy test $testname: ns1 got reply \"$result\" connecting to ns3"
+	else
+		echo "ERROR: tproxy test $testname: ns1 got reply \"$result\" connecting to ns3, not \"$expect_ns1_ns3\" as intended"
 		ret=1
-		return
 	fi
 
-	# must also check that nat table was evaluated on second (lower device) iteration.
-	if ip netns exec "$ns0" nft list table ip nat |grep -q 'counter packets 1'; then
-		echo "PASS: connect with masquerade + sport rewrite on veth device"
+	# request from ns1 to vip (DNAT to ns2)
+	result=$(echo PING | ip netns exec "$ns1" socat -t 2 -T 2 STDIO udp:"$vip_ip_port")
+	if [ "$result" = "PONG_NS2" ] ;then
+		echo "PASS: tproxy test $testname: ns1 got reply \"$result\" connecting to ns3"
 	else
-		echo "FAIL: vrf masq rule has unexpected counter value"
+		echo "ERROR: tproxy test $testname: ns1 got reply \"$result\" connecting to ns3, not \"$expect_ns1_ns3\" as intended"
 		ret=1
 	fi
-}
 
-test_fib()
-{
-ip netns exec "$ns0" nft -f - <<EOF
+	# replace the DNAT rule to direct and replace ns2 destination with ns3
+		ip netns exec "$nsrouter" nft -f /dev/stdin <<EOF
 flush ruleset
-table ip t {
-	counter fibcount { }
-
-	chain prerouting {
-		type filter hook prerouting priority 0;
-		meta iifname veth0 ip daddr $DUMMYNET.2 fib daddr oif dummy0 counter name fibcount notrack
+table inet filter {
+	chain divert {
+		type filter hook prerouting priority 0; policy accept;
+		$ip_proto daddr $vip udp dport 8080 dnat to $ns3_ip
 	}
 }
 EOF
-	ip -net "$ns1" route add 10.9.9.0/24 via "$IP0" dev veth0
-	ip netns exec "$ns1" ping -q -w 1 -c 1 "$DUMMYNET".2 > /dev/null
-
-	if ip netns exec "$ns0" nft list counter t fibcount | grep -q "packets 1"; then
-		echo "PASS: fib lookup returned exepected output interface"
+	# request from ns1 to vip (DNAT to ns3)
+	# reuse the same port to validate the existing conntrack entry does not
+	# shadow the actual nftables rule.
+	result=$(echo PING | ip netns exec "$ns1" socat -t 2 -T 2 STDIO udp:"$vip_ip_port")
+	if [ "$result" = "PONG_NS3" ] ;then
+		echo "PASS: tproxy test $testname: ns1 got reply \"$result\" connecting to ns3"
 	else
-		echo "FAIL: fib lookup did not return exepected output interface"
+		echo "ERROR: tproxy test $testname: ns1 got reply \"$result\" connecting to ns3, not \"$expect_ns1_ns3\" as intended"
 		ret=1
-		return
 	fi
 }
 
-test_ct_zone_in
-test_masquerade_vrf "default"
-test_masquerade_vrf "pfifo"
-test_masquerade_veth
-test_fib
+
+if test_ping; then
+	# queue bypass works (rules were skipped, no listener)
+	echo "PASS: ${ns1} can reach ${ns2}"
+else
+	echo "FAIL: ${ns1} cannot reach ${ns2}: $ret" 1>&2
+	exit $ret
+fi
+
+test_conntrack_udp_expires "ip"
+test_conntrack_udp_expires "ip6"
 
 exit $ret

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-11-12 14:41                   ` Antonio Ojea
@ 2024-11-12 14:43                     ` Antonio Ojea
  2024-11-12 16:18                     ` Florian Westphal
  1 sibling, 0 replies; 23+ messages in thread
From: Antonio Ojea @ 2024-11-12 14:43 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter

On Tue, 12 Nov 2024 at 07:41, Antonio Ojea
<antonio.ojea.garcia@gmail.com> wrote:
>
> On Tue, 12 Nov 2024 at 02:20, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >
> > On Tue, Nov 12, 2024 at 10:16:45AM +0100, Pablo Neira Ayuso wrote:
> > > I guess the concern is that assured flows cannot be expelled from the
> > > conntrack table via early_drop, that is why an expedite cleanup is
> > > important?
> >
> > Actually, the issue is that packets could end up in a backend which
> > does not exist after re-configuration, therefore, removing the entry
> > need to happen so ongoing flow have a chance to talk to another
> > (different) backend.
>
> Please take a look to this kselftest attached that emulates the
> problematic behavior in kubernetes,
>
> I think that in UDP the nat rule should take precedence over the
> conntrack entry,on the contrary to TCP where it is important to
> preserve the session if it has been established.
>
> I  did only a quick test and seems to fail also with Florian patch

please disregard the
a/tools/testing/selftests/net/netfilter/conntrack_vrf.sh file in the
patch, is a leftover of some local tests

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-11-12 14:41                   ` Antonio Ojea
  2024-11-12 14:43                     ` Antonio Ojea
@ 2024-11-12 16:18                     ` Florian Westphal
  2024-11-15  4:11                       ` Antonio Ojea
  1 sibling, 1 reply; 23+ messages in thread
From: Florian Westphal @ 2024-11-12 16:18 UTC (permalink / raw)
  To: Antonio Ojea; +Cc: Pablo Neira Ayuso, Florian Westphal, netfilter

Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> On Tue, 12 Nov 2024 at 02:20, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >
> > On Tue, Nov 12, 2024 at 10:16:45AM +0100, Pablo Neira Ayuso wrote:
> > > I guess the concern is that assured flows cannot be expelled from the
> > > conntrack table via early_drop, that is why an expedite cleanup is
> > > important?
> >
> > Actually, the issue is that packets could end up in a backend which
> > does not exist after re-configuration, therefore, removing the entry
> > need to happen so ongoing flow have a chance to talk to another
> > (different) backend.
> 
> Please take a look to this kselftest attached that emulates the
> problematic behavior in kubernetes,
> 
> I think that in UDP the nat rule should take precedence over the
> conntrack entry,on the contrary to TCP where it is important to
> preserve the session if it has been established.

Why? The peer is even alive in your test; from your initial description
I thought this was about 'peer stops responding, but udp conntrack
remains alive forever due to client-probes'.

This is just silly, we can't make a change to auto-toss all mappings
on a nat rule change.

What do you do when someone uses random sampling and refreshes the
mapping table?

Kernel doesn't know what kind of upper layer protocol is used, what
if its a stateful protocol that breaks when you packets get steered
somewhere else mid-stream?

Did you evaluate use of stateless NAT for your use case?  That would
follow rules 1:1 and thus break depending on the protocol expectations,
or not.

For insanity like this I think we really can't do anything except offer
an efficient conntrack table flush mechanism to avoid any loop in
userspace.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-11-12 16:18                     ` Florian Westphal
@ 2024-11-15  4:11                       ` Antonio Ojea
  2024-12-01 17:00                         ` Antonio Ojea
  0 siblings, 1 reply; 23+ messages in thread
From: Antonio Ojea @ 2024-11-15  4:11 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter

On Tue, 12 Nov 2024 at 09:18, Florian Westphal <fw@strlen.de> wrote:
>
> Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> > On Tue, 12 Nov 2024 at 02:20, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > >
> > > On Tue, Nov 12, 2024 at 10:16:45AM +0100, Pablo Neira Ayuso wrote:
> > > > I guess the concern is that assured flows cannot be expelled from the
> > > > conntrack table via early_drop, that is why an expedite cleanup is
> > > > important?
> > >
> > > Actually, the issue is that packets could end up in a backend which
> > > does not exist after re-configuration, therefore, removing the entry
> > > need to happen so ongoing flow have a chance to talk to another
> > > (different) backend.
> >
> > Please take a look to this kselftest attached that emulates the
> > problematic behavior in kubernetes,
> >
> > I think that in UDP the nat rule should take precedence over the
> > conntrack entry,on the contrary to TCP where it is important to
> > preserve the session if it has been established.
>
> Why? The peer is even alive in your test; from your initial description
> I thought this was about 'peer stops responding, but udp conntrack
> remains alive forever due to client-probes'.
>
> This is just silly, we can't make a change to auto-toss all mappings
> on a nat rule change.
>
> What do you do when someone uses random sampling and refreshes the
> mapping table?
>
> Kernel doesn't know what kind of upper layer protocol is used, what
> if its a stateful protocol that breaks when you packets get steered
> somewhere else mid-stream?
>
> Did you evaluate use of stateless NAT for your use case?  That would
> follow rules 1:1 and thus break depending on the protocol expectations,
> or not.
>
> For insanity like this I think we really can't do anything except offer
> an efficient conntrack table flush mechanism to avoid any loop in
> userspace.

I recognize that I put a very extreme case and without the kubernetes
context is confusing, because the availability of the backends in
Kubernetes is represented by other high level APIs, my apologies.

Let's forget about this use case please, and let me try to redo the
test to represent the one that the backend stops replying, that is
still very useful.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Most optimal method to dump UDP conntrack entries
  2024-11-15  4:11                       ` Antonio Ojea
@ 2024-12-01 17:00                         ` Antonio Ojea
  0 siblings, 0 replies; 23+ messages in thread
From: Antonio Ojea @ 2024-12-01 17:00 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter

[-- Attachment #1: Type: text/plain, Size: 340 bytes --]

> Let's forget about this use case please, and let me try to redo the
> test to represent the one that the backend stops replying, that is
> still very useful.

Florian I modified the test (see file attached) and your patch is
working with it solving this problem.
I really think this is very useful, appreciate it if you can implement it.

[-- Attachment #2: 0001-selftest-netfilter-add-test-case-for-stale-udp-conne.patch --]
[-- Type: application/octet-stream, Size: 10477 bytes --]

From 1ede3b70e346649b916756f261f4f3fee450b162 Mon Sep 17 00:00:00 2001
From: Antonio Ojea <aojea@google.com>
Date: Tue, 12 Nov 2024 13:57:19 +0000
Subject: [PATCH] selftest: netfilter: add test case for stale udp connections

This commit adds a test case to verify that UDP conntrack entries
are not renewed by unidirectional trafic, to avoid cases when netfilter
DNAT rules are changed but the new packets keep matching the existing
conntrack rules that take precedence over existing netfilter rules.

The test case sets up specific DNAT rules and then establishes
a UDP connection. It then checks that packets reaches the expected
destination and changes the DNAT rule to reach a different destination.
Then it checks that after multiple requests, the sender is able to reach
the the new destination.

Based on test added by Florian Westphal in 37d220b.

Signed-off-by: Antonio Ojea <aojea@google.com>
---
 .../testing/selftests/net/netfilter/Makefile  |   1 +
 .../net/netfilter/conntrack_udp_expires.sh    | 262 ++++++++++++++++++
 2 files changed, 263 insertions(+)
 create mode 100755 tools/testing/selftests/net/netfilter/conntrack_udp_expires.sh

diff --git a/tools/testing/selftests/net/netfilter/Makefile b/tools/testing/selftests/net/netfilter/Makefile
index ffe161fac8b5..9ba5610f3ebc 100644
--- a/tools/testing/selftests/net/netfilter/Makefile
+++ b/tools/testing/selftests/net/netfilter/Makefile
@@ -12,6 +12,7 @@ TEST_PROGS += conntrack_dump_flush.sh
 TEST_PROGS += conntrack_icmp_related.sh
 TEST_PROGS += conntrack_ipip_mtu.sh
 TEST_PROGS += conntrack_tcp_unreplied.sh
+TEST_PROGS += conntrack_udp_expires.sh
 TEST_PROGS += conntrack_sctp_collision.sh
 TEST_PROGS += conntrack_vrf.sh
 TEST_PROGS += conntrack_reverse_clash.sh
diff --git a/tools/testing/selftests/net/netfilter/conntrack_udp_expires.sh b/tools/testing/selftests/net/netfilter/conntrack_udp_expires.sh
new file mode 100755
index 000000000000..7e2faaefe72a
--- /dev/null
+++ b/tools/testing/selftests/net/netfilter/conntrack_udp_expires.sh
@@ -0,0 +1,262 @@
+#!/bin/bash
+#
+# This tests conntrack on the following scenario:
+#
+#                         +------------+
+# +-------+               |  nsrouter  |                  +-------+
+# |ns1    |.99          .1|            |.1             .99|    ns2|
+# |   eth0|---------------|veth0  veth1|------------------|eth0   |
+# |       |  10.0.1.0/24  |            |   10.0.2.0/24    |       |
+# +-------+  dead:1::/64  |    veth2   |   dead:2::/64    +-------+
+#                         +------------+
+#                                |.1
+#                                |
+#                                |
+#                                |                        +-------+
+#                                |                     .99|    ns3|
+#                                +------------------------|eth0   |
+#                                       10.0.3.0/24       |       |
+#                                       dead:3::/64       +-------+
+#
+# nsrouters implement loadbalancing using DNAT with a virtual IP
+# 10.0.4.10 - dead:4::a
+# shellcheck disable=SC2162,SC2317
+
+source lib.sh
+ret=0
+# UDP is slow
+timeout=15
+
+cleanup()
+{
+	ip netns pids "$ns1" | xargs kill 2>/dev/null
+	ip netns pids "$ns2" | xargs kill 2>/dev/null
+	ip netns pids "$ns3" | xargs kill 2>/dev/null
+	ip netns pids "$nsrouter" | xargs kill 2>/dev/null
+
+	cleanup_all_ns
+}
+
+checktool "nft --version" "test without nft tool"
+checktool "socat -h" "run test without socat"
+
+trap cleanup EXIT
+setup_ns ns1 ns2 ns3 nsrouter
+
+if ! ip link add veth0 netns "$nsrouter" type veth peer name eth0 netns "$ns1" > /dev/null 2>&1; then
+    echo "SKIP: No virtual ethernet pair device support in kernel"
+    exit $ksft_skip
+fi
+ip link add veth1 netns "$nsrouter" type veth peer name eth0 netns "$ns2"
+ip link add veth2 netns "$nsrouter" type veth peer name eth0 netns "$ns3"
+
+ip -net "$nsrouter" link set veth0 up
+ip -net "$nsrouter" addr add 10.0.1.1/24 dev veth0
+ip -net "$nsrouter" addr add dead:1::1/64 dev veth0 nodad
+
+ip -net "$nsrouter" link set veth1 up
+ip -net "$nsrouter" addr add 10.0.2.1/24 dev veth1
+ip -net "$nsrouter" addr add dead:2::1/64 dev veth1 nodad
+
+ip -net "$nsrouter" link set veth2 up
+ip -net "$nsrouter" addr add 10.0.3.1/24 dev veth2
+ip -net "$nsrouter" addr add dead:3::1/64 dev veth2 nodad
+
+ip -net "$ns1" link set eth0 up
+ip -net "$ns2" link set eth0 up
+ip -net "$ns3" link set eth0 up
+
+ip -net "$ns1" addr add 10.0.1.99/24 dev eth0
+ip -net "$ns1" addr add dead:1::99/64 dev eth0 nodad
+ip -net "$ns1" route add default via 10.0.1.1
+ip -net "$ns1" route add default via dead:1::1
+
+ip -net "$ns2" addr add 10.0.2.99/24 dev eth0
+ip -net "$ns2" addr add dead:2::99/64 dev eth0 nodad
+ip -net "$ns2" route add default via 10.0.2.1
+ip -net "$ns2" route add default via dead:2::1
+
+ip -net "$ns3" addr add 10.0.3.99/24 dev eth0
+ip -net "$ns3" addr add dead:3::99/64 dev eth0 nodad
+ip -net "$ns3" route add default via 10.0.3.1
+ip -net "$ns3" route add default via dead:3::1
+
+ip netns exec "$nsrouter" sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth0.forwarding=1 > /dev/null
+ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth1.forwarding=1 > /dev/null
+ip netns exec "$nsrouter" sysctl net.ipv4.conf.veth2.forwarding=1 > /dev/null
+
+test_ping() {
+  if ! ip netns exec "$ns1" ping -c 1 -q 10.0.2.99 > /dev/null; then
+	return 1
+  fi
+
+  if ! ip netns exec "$ns1" ping -c 1 -q dead:2::99 > /dev/null; then
+	return 2
+  fi
+
+  if ! ip netns exec "$ns1" ping -c 1 -q 10.0.3.99 > /dev/null; then
+	return 1
+  fi
+
+  if ! ip netns exec "$ns1" ping -c 1 -q dead:3::99 > /dev/null; then
+	return 2
+  fi
+
+  return 0
+}
+
+test_ping_router() {
+  if ! ip netns exec "$ns1" ping -c 1 -q 10.0.2.1 > /dev/null; then
+	return 3
+  fi
+
+  if ! ip netns exec "$ns1" ping -c 1 -q dead:2::1 > /dev/null; then
+	return 4
+  fi
+
+  return 0
+}
+
+
+listener_ready()
+{
+	local ns="$1"
+	local port="$2"
+	local proto="$3"
+	ss -N "$ns" -ln "$proto" -o "sport = :$port" | grep -q "$port"
+}
+
+test_conntrack_udp_expires()
+{
+	local ip_proto="$1"
+	# derived variables
+	local testname="test_${ip_proto}_udp_forward"
+	local socat_ipproto
+	local vip
+	local ns2_ip
+	local ns3_ip
+	local ns2_ip_port
+	local ns3_ip_port
+
+	# socat 1.8.0 has a bug that requires to specify the IP family to bind (fixed in 1.8.0.1)
+	case $ip_proto in
+	"ip")
+		socat_ipproto="-4"
+		vip=10.0.4.10
+		ns2_ip=10.0.2.99
+		ns3_ip=10.0.3.99
+		vip_ip_port="$vip:8080"
+		ns2_ip_port="$ns2_ip:8080"
+		ns3_ip_port="$ns3_ip:8080"
+	;;
+	"ip6")
+		socat_ipproto="-6"
+		vip=dead:4::a
+		ns2_ip=dead:2::99
+		ns3_ip=dead:3::99
+		vip_ip_port="[$vip]:8080"
+		ns2_ip_port="[$ns2_ip]:8080"
+		ns3_ip_port="[$ns3_ip]:8080"
+	;;
+	*)
+	echo "FAIL: unsupported protocol"
+	exit 255
+	;;
+	esac
+
+	ip netns exec "$nsrouter" nft -f /dev/stdin <<EOF
+flush ruleset
+table inet nat {
+	chain kube-proxy {
+		type nat hook prerouting priority 0; policy accept;
+		$ip_proto daddr $vip udp dport 8080 dnat to $ns2_ip_port
+	}
+}
+EOF
+	# Set low UDP timeouts to make the test faster since it has to wait for the conntrack entries to expire
+	ip netns exec "$nsrouter" bash -c 'printf 5 > /proc/sys/net/netfilter/nf_conntrack_udp_timeout'
+	ip netns exec "$nsrouter" bash -c 'printf 5 > /proc/sys/net/netfilter/nf_conntrack_udp_timeout_stream'
+
+	timeout "$timeout" ip netns exec "$ns2" socat "$socat_ipproto" udp-listen:8080,fork SYSTEM:"echo PONG_NS2" 2>/dev/null &
+	local server2_pid=$!
+
+	timeout "$timeout" ip netns exec "$ns3" socat "$socat_ipproto" udp-listen:8080,fork SYSTEM:"echo PONG_NS3" 2>/dev/null &
+	local server3_pid=$!
+
+	busywait "$BUSYWAIT_TIMEOUT" listener_ready "$ns2" 8080 "-u"
+	busywait "$BUSYWAIT_TIMEOUT" listener_ready "$ns3" 8080 "-u"
+
+	local result
+	# request from ns1 to ns2 (direct traffic)
+	result=$(echo PING | ip netns exec "$ns1" socat -t 2 -T 2 STDIO udp:"$ns2_ip_port",sourceport=18888)
+	if [ "$result" == "PONG_NS2" ] ;then
+		echo "PASS: $testname: ns1 got reply \"$result\" connecting to ns2"
+	else
+		echo "ERROR: $testname: ns1 got reply \"$result\" connecting to ns2, not \"PONG_NS2\" as intended"
+		ret=1
+	fi
+
+	# request from ns1 to ns3 (direct traffic)
+	result=$(echo PING | ip netns exec "$ns1" socat -t 2 -T 2 STDIO udp:"$ns3_ip_port",sourceport=18888)
+	if [ "$result" = "PONG_NS3" ] ;then
+		echo "PASS: $testname: ns1 got reply \"$result\" connecting to ns3"
+	else
+		echo "ERROR: $testname: ns1 got reply \"$result\" connecting to ns3, not \"PONG_NS3\" as intended"
+		ret=1
+	fi
+
+	# request from ns1 to vip (DNAT to ns2)
+	result=$(echo PING | ip netns exec "$ns1" socat -t 2 -T 2 STDIO udp:"$vip_ip_port",sourceport=18888)
+	if [ "$result" = "PONG_NS2" ] ;then
+		echo "PASS: $testname: ns1 got reply \"$result\" connecting to vip (ns2)"
+	else
+		echo "ERROR: $testname: ns1 got reply \"$result\" connecting to vip, not \"PONG_NS2\" as intended"
+		ret=1
+	fi
+
+	# kill the server listening in ns2
+	kill $server2_pid 2>/dev/null
+
+	# replace the DNAT rule to direct and replace ns2 destination with ns3
+	ip netns exec "$nsrouter" nft -f /dev/stdin <<EOF
+flush ruleset
+table inet nat {
+	chain kube-proxy {
+		type nat hook prerouting priority 0; policy accept;
+		$ip_proto daddr $vip udp dport 8080 dnat to $ns3_ip_port
+	}
+}
+EOF
+
+	# requests from ns1 to vip (DNAT to ns3) should fail but not renew the conntrack entry
+	# once the conntrack entry expires it should receive the response from the ns3
+	# we have to reuse the same port to hit the existing conntrack entry
+	for i in $(seq 1 20) ; do
+		result=$(echo PING | ip netns exec "$ns1" socat -t 2 -T 2 STDIO udp:"$vip_ip_port",sourceport=18888)
+		if [ "$result" = "PONG_NS3" ] ;then
+			echo "PASS: $testname: ns1 got reply \"$result\" connecting to vip (ns3)"
+			return
+		else
+			echo "LOG: $testname: ns1 got reply \"$result\" connecting to vip, retrying ..."
+		fi
+		sleep .5
+	done
+	# there was not answer from ns3
+	echo "ERROR: $testname: ns1 did not get reply connecting to vip after 20 attempts"
+	ret=1
+}
+
+
+if test_ping; then
+	# queue bypass works (rules were skipped, no listener)
+	echo "PASS: ${ns1} can reach ${ns2}"
+else
+	echo "FAIL: ${ns1} cannot reach ${ns2}: $ret" 1>&2
+	exit $ret
+fi
+
+test_conntrack_udp_expires "ip"
+test_conntrack_udp_expires "ip6"
+
+exit $ret
-- 
2.47.0.338.g60cca15819-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-12-01 17:01 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-17 10:26 Most optimal method to dump UDP conntrack entries Antonio Ojea
2024-10-17 12:46 ` Florian Westphal
2024-10-17 16:36   ` Pablo Neira Ayuso
2024-10-17 22:10     ` Antonio Ojea
2024-10-17 23:30       ` Florian Westphal
2024-10-18 11:05         ` Antonio Ojea
2024-10-18 11:33           ` Florian Westphal
2024-10-18 14:10             ` Antonio Ojea
2024-10-21 13:53               ` Florian Westphal
2024-10-23  9:03                 ` Benny Lyne Amorsen
2024-11-10 21:50                 ` Florian Westphal
2024-11-11  6:33                   ` Antonio Ojea
2024-11-11 12:06       ` Pablo Neira Ayuso
2024-11-11 12:09         ` Florian Westphal
2024-11-11 12:29           ` Pablo Neira Ayuso
2024-11-11 12:54             ` Florian Westphal
2024-11-12  9:16               ` Pablo Neira Ayuso
2024-11-12  9:20                 ` Pablo Neira Ayuso
2024-11-12 14:41                   ` Antonio Ojea
2024-11-12 14:43                     ` Antonio Ojea
2024-11-12 16:18                     ` Florian Westphal
2024-11-15  4:11                       ` Antonio Ojea
2024-12-01 17:00                         ` Antonio Ojea

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.