Linux Netfilter discussions
 help / color / mirror / Atom feed
From: Florian Westphal <fw@strlen.de>
To: Antonio Ojea <antonio.ojea.garcia@gmail.com>
Cc: netfilter@vger.kernel.org
Subject: Re: Most optimal method to dump UDP conntrack entries
Date: Thu, 17 Oct 2024 14:46:32 +0200	[thread overview]
Message-ID: <20241017124632.GC12005@breakpoint.cc> (raw)
In-Reply-To: <CABhP=tZe2MDYJWqmyhb=we2g_v11wuDb6yBp878vt7qQYrvUiA@mail.gmail.com>

Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> In the context of Kubernetes, when DNATing entries for UDP Services,
> we need to deal with some edge cases where some UDP entries are left
> orphaned but blackhole the traffic to the new endpoints.
> 
> At high level, the scenario is:
> - Client IP_A sends UDP traffic to VirtualIP IP_B --> Kubernetes
> Translates this to Endpoint IP_C
> - Endpoint IP_C is replaced by Endpoint IP_D, but since Client IP_A
> does not stop sending traffic, the conntrack entry IP_A IP_B --> IP_C
> takes precedence and is being renewed, so traffic is not sent to the
> new Endpoint IP_D and is lost.
> 
> To solve this problem, we have some heuristics to detect those
> scenarios when the endpoints change and flush the conntrack entries,
> however, since this is event based, if we lost the event that
> triggered the problem or something happens that fails to clean up the
> entry,  the user need to manually flush the entries.
> 
> We are implementing a new approach to solve this, we list all the UDP
> conntrack entries using netlink, compare against the existing
> programmed nftables/iptables rules, and flush the ones we know are
> stale.
> 
> During the implementation review, the question [1] this raises is, how
> impactful is it to dump all the conntrack entries each time we program
> the iptables/nftables rules (this can be every 1s on nodes with a lot
> of entries)?
> Is this approach completely safe?
> Should we try to read from procfs instead?

Walking all conntrack entries in 1s intervals is going to be slow, no
matter the chosen interface.  Even doing the filtering in the kernel to
not dump all entries but only those that match udp/port/ip criteria is
not going to change it.

Also both proc and netlink dumps can miss entries (albeit its rare),
if parallel insertions/deletes happen (which is normal on busy system).

I wonder why the appropriate delete requests cannot be done when the
mapping is altered, I mean, you must have some code that issues
either iptables -t nat -D ... or nft delete element ... or similar.

If you do that, why not also fire off the conntrack -D request
afterwards?  Or are these publish/withdraw so frequent that this
doesn't matter compared to poll based approach?

Something like
   conntrack -D --protonum 17 --orig-dst $vserver --orig-port-dst 53 --reply-src $rserver --reply-port-src 5353

would zap everything to $rserver mapped to $vserver from client point of view.

Granted, this isn't great either, but you would not have to poll
all the time?  Or are updates

Is this only a problem for UDP?  I wonder if we should change UDP
conntrack to no longer refresh timestamp for original direction if
connection is subject to NAT, that would make them expire in the given
'dnat mapping went away and client tries to talk to unreachable host'
scenario.

  reply	other threads:[~2024-10-17 12:46 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-17 10:26 Most optimal method to dump UDP conntrack entries Antonio Ojea
2024-10-17 12:46 ` Florian Westphal [this message]
2024-10-17 16:36   ` Pablo Neira Ayuso
2024-10-17 22:10     ` Antonio Ojea
2024-10-17 23:30       ` Florian Westphal
2024-10-18 11:05         ` Antonio Ojea
2024-10-18 11:33           ` Florian Westphal
2024-10-18 14:10             ` Antonio Ojea
2024-10-21 13:53               ` Florian Westphal
2024-10-23  9:03                 ` Benny Lyne Amorsen
2024-11-10 21:50                 ` Florian Westphal
2024-11-11  6:33                   ` Antonio Ojea
2024-11-11 12:06       ` Pablo Neira Ayuso
2024-11-11 12:09         ` Florian Westphal
2024-11-11 12:29           ` Pablo Neira Ayuso
2024-11-11 12:54             ` Florian Westphal
2024-11-12  9:16               ` Pablo Neira Ayuso
2024-11-12  9:20                 ` Pablo Neira Ayuso
2024-11-12 14:41                   ` Antonio Ojea
2024-11-12 14:43                     ` Antonio Ojea
2024-11-12 16:18                     ` Florian Westphal
2024-11-15  4:11                       ` Antonio Ojea
2024-12-01 17:00                         ` Antonio Ojea

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241017124632.GC12005@breakpoint.cc \
    --to=fw@strlen.de \
    --cc=antonio.ojea.garcia@gmail.com \
    --cc=netfilter@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox