Re: Most optimal method to dump UDP conntrack entries

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Pablo Neira Ayuso <pablo@netfilter.org>
To: Antonio Ojea <antonio.ojea.garcia@gmail.com>
Cc: Florian Westphal <fw@strlen.de>, netfilter@vger.kernel.org
Subject: Re: Most optimal method to dump UDP conntrack entries
Date: Mon, 11 Nov 2024 13:06:38 +0100	[thread overview]
Message-ID: <ZzHzTvYFOPcfWvRs@calendula> (raw)
In-Reply-To: <CABhP=ta5FesHP+xnPq7UaiNvvorfGgPJZV9Xxdvk_-SLk7ZxEg@mail.gmail.com>

Hi Antonio,

On Thu, Oct 17, 2024 at 11:10:02PM +0100, Antonio Ojea wrote:
> On Thu, 17 Oct 2024 at 17:36, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >
> > On Thu, Oct 17, 2024 at 02:46:32PM +0200, Florian Westphal wrote:
> > > Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> > > > In the context of Kubernetes, when DNATing entries for UDP Services,
> > > > we need to deal with some edge cases where some UDP entries are left
> > > > orphaned but blackhole the traffic to the new endpoints.
> > > >
> > > > At high level, the scenario is:
> > > > - Client IP_A sends UDP traffic to VirtualIP IP_B --> Kubernetes
> > > > Translates this to Endpoint IP_C
> > > > - Endpoint IP_C is replaced by Endpoint IP_D, but since Client IP_A
> > > > does not stop sending traffic, the conntrack entry IP_A IP_B --> IP_C
> > > > takes precedence and is being renewed, so traffic is not sent to the
> > > > new Endpoint IP_D and is lost.
> > > >
> > > > To solve this problem, we have some heuristics to detect those
> > > > scenarios when the endpoints change and flush the conntrack entries,
> > > > however, since this is event based, if we lost the event that
> > > > triggered the problem or something happens that fails to clean up the
> > > > entry,  the user need to manually flush the entries.

You can still stick to the event approach, then resort to
resync/reconcile loop when userspace gets a report that events are
getting lost, ie. hybrid approach.

> > > > We are implementing a new approach to solve this, we list all the UDP
> > > > conntrack entries using netlink, compare against the existing
> > > > programmed nftables/iptables rules, and flush the ones we know are
> > > > stale.
> > > >
> > > > During the implementation review, the question [1] this raises is, how
> > > > impactful is it to dump all the conntrack entries each time we program
> > > > the iptables/nftables rules (this can be every 1s on nodes with a lot
> > > > of entries)?
> > > > Is this approach completely safe?
> > > > Should we try to read from procfs instead?
> > >
> > > Walking all conntrack entries in 1s intervals is going to be slow, no
> > > matter the chosen interface.  Even doing the filtering in the kernel to
> > > not dump all entries but only those that match udp/port/ip criteria is
> > > not going to change it.
> 
> We are not worried about being slow in the order of seconds, the
> system is eventually consistent so there can always be a reasonable
> latency.
> Since we only care about UDP, losing packets during that period is not
> desirable but is assumable.
> My main concern is if constantly dumping all the entries via netlink
> can cause any issue or increase resources consumption.
>
> > >
> > > Also both proc and netlink dumps can miss entries (albeit its rare),
> > > if parallel insertions/deletes happen (which is normal on busy system).
> > >
> 
> That is one of the reasons we want to implement this reconcile loop,
> so it can be resilient to this kind of errors, we keep the state on
> the API in the control plane, so we can always rebuild the state in
> the dataplane (recreating nftables rules and delete conntrack entries
> that does not match the current state)
> 
> > > I wonder why the appropriate delete requests cannot be done when the
> > > mapping is altered, I mean, you must have some code that issues
> > > either iptables -t nat -D ... or nft delete element ... or similar.
> > >
> > > If you do that, why not also fire off the conntrack -D request
> > > afterwards?  Or are these publish/withdraw so frequent that this
> > > doesn't matter compared to poll based approach?
> > >
> > > Something like
> > >    conntrack -D --protonum 17 --orig-dst $vserver --orig-port-dst 53 --reply-src $rserver --reply-port-src 5353
> > >
> > > would zap everything to $rserver mapped to $vserver from client point of view.
> >
> 
> This is how it is implemented today and it works, but it does not
> handle process restarts per example, or is not resilient to errors.
> The implementation is also much more complex because we need to
> implement all the possible edge cases that can leave stale entries

It should also be possible to shrink timeouts on restart via conntrack -U
which would be similar to the approach that Florian is proposing, but from
control plane rather than updating existing UDP timeout policy.

next prev parent reply	other threads:[~2024-11-11 12:06 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-17 10:26 Most optimal method to dump UDP conntrack entries Antonio Ojea
2024-10-17 12:46 ` Florian Westphal
2024-10-17 16:36   ` Pablo Neira Ayuso
2024-10-17 22:10     ` Antonio Ojea
2024-10-17 23:30       ` Florian Westphal
2024-10-18 11:05         ` Antonio Ojea
2024-10-18 11:33           ` Florian Westphal
2024-10-18 14:10             ` Antonio Ojea
2024-10-21 13:53               ` Florian Westphal
2024-10-23  9:03                 ` Benny Lyne Amorsen
2024-11-10 21:50                 ` Florian Westphal
2024-11-11  6:33                   ` Antonio Ojea
2024-11-11 12:06       ` Pablo Neira Ayuso [this message]
2024-11-11 12:09         ` Florian Westphal
2024-11-11 12:29           ` Pablo Neira Ayuso
2024-11-11 12:54             ` Florian Westphal
2024-11-12  9:16               ` Pablo Neira Ayuso
2024-11-12  9:20                 ` Pablo Neira Ayuso
2024-11-12 14:41                   ` Antonio Ojea
2024-11-12 14:43                     ` Antonio Ojea
2024-11-12 16:18                     ` Florian Westphal
2024-11-15  4:11                       ` Antonio Ojea
2024-12-01 17:00                         ` Antonio Ojea

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZzHzTvYFOPcfWvRs@calendula \
    --to=pablo@netfilter.org \
    --cc=antonio.ojea.garcia@gmail.com \
    --cc=fw@strlen.de \
    --cc=netfilter@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.