All of lore.kernel.org
 help / color / mirror / Atom feed
From: Florian Westphal <fw@strlen.de>
To: Antonio Ojea <antonio.ojea.garcia@gmail.com>
Cc: Florian Westphal <fw@strlen.de>,
	Pablo Neira Ayuso <pablo@netfilter.org>,
	netfilter@vger.kernel.org
Subject: Re: Most optimal method to dump UDP conntrack entries
Date: Fri, 18 Oct 2024 13:33:18 +0200	[thread overview]
Message-ID: <20241018113318.GA28324@breakpoint.cc> (raw)
In-Reply-To: <CABhP=tY=YnMPSMH0-T2oxHyE2Uzoy=RnHQmO4DRQ_Go8xzMGeg@mail.gmail.com>

Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> > The UDP tracker is old and refreshs the timeout for every packet, but
> > who says it needs to stay that way?
> >
> > With very simple patch to only refresh in 'reply' dir:
> >
> >     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > [DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> >     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > [DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> >     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> >
> > This is with a unidirectional UDP stream.
> >
> > diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
> > --- a/net/netfilter/nf_conntrack_proto_udp.c
> > +++ b/net/netfilter/nf_conntrack_proto_udp.c
> > @@ -17,6 +17,7 @@
> >  #include <linux/netfilter.h>
> >  #include <linux/netfilter_ipv4.h>
> >  #include <linux/netfilter_ipv6.h>
> > +#include <net/netfilter/nf_conntrack_acct.h>
> >  #include <net/netfilter/nf_conntrack_l4proto.h>
> >  #include <net/netfilter/nf_conntrack_ecache.h>
> >  #include <net/netfilter/nf_conntrack_timeout.h>
> > @@ -80,6 +81,35 @@ static bool udp_error(struct sk_buff *skb,
> >         return false;
> >  }
> >
> > +static void nf_ct_refresh_udp(struct nf_conn *ct,
> > +                             enum ip_conntrack_dir dir,
> > +                             u32 len, u32 extra_jiffies)
> > +{
> > +       /* Only update if this is not a fixed timeout */
> > +       if (test_bit(IPS_FIXED_TIMEOUT_BIT, &ct->status))
> > +               return nf_ct_acct_update(ct, dir, len);
> > +
> > +       switch (dir) {
> > +       case IP_CT_DIR_ORIGINAL:
> > +               /* do not refresh in original dir, else stale dnat mapping
> > +                * can be kept alive indefinitely.
> > +                */
> > +               if (nf_ct_is_confirmed(ct))
> > +                       return nf_ct_acct_update(ct, dir, len);
> > +               break;
> > +       case IP_CT_DIR_REPLY:
> > +               extra_jiffies += nfct_time_stamp;
> > +               break;
> > +       case IP_CT_DIR_MAX: /* silence warning wrt. unhandled enum */
> > +               break;
> > +       }
> > +
> > +       if (READ_ONCE(ct->timeout) != extra_jiffies)
> > +               WRITE_ONCE(ct->timeout, extra_jiffies);
> > +
> > +       nf_ct_acct_update(ct, dir, len);
> > +}
> > +
> >  /* Returns verdict for packet, and may modify conntracktype */
> >  int nf_conntrack_udp_packet(struct nf_conn *ct,
> >                             struct sk_buff *skb,
> > @@ -114,7 +144,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
> >                         stream = (status & IPS_ASSURED) == 0;
> >                 }
> >
> > -               nf_ct_refresh_acct(ct, ctinfo, skb, extra);
> > +               nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, extra);
> >
> >                 /* never set ASSURED for IPS_NAT_CLASH, they time out soon */
> >                 if (unlikely((status & IPS_NAT_CLASH)))
> > @@ -124,7 +154,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
> >                 if (stream && !test_and_set_bit(IPS_ASSURED_BIT, &ct->status))
> >                         nf_conntrack_event_cache(IPCT_ASSURED, ct);
> >         } else {
> > -               nf_ct_refresh_acct(ct, ctinfo, skb, timeouts[UDP_CT_UNREPLIED]);
> > +               nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, timeouts[UDP_CT_UNREPLIED]);
> >         }
> >         return NF_ACCEPT;
> >  }
> >
> > The obvious counter-argument is "but what if its the reply side that is
> > sending, and we have a stale SNAT mapping?".
> >
> > Solution would be to track timestamp of last packet seen per direction,
> > and then stop refreshing the timeout if one side ceases sending.
> >
> > I think thats much better than what we have now and there would be no
> > need to actively zap UDP flows, they would timeout automatically once
> > one side ceases to transmit.
> 
> If I understand correctly this will have some visible effects, before
> this change the same tuple will only hit the same destination because
> the conntrack entry will prevail, and after this patch the packet will
> be spread out randomly to the number of backends.

Nothing changes unless flow is 100% uni-directional (no
return traffic OR only reply traffic after initial packet).

Old behavior:

Packet	 Direction
     1   Origin			# create entry with 30s timeout, save current time
     2   Reply			# flag entry assured (cannot be fast-removed
				# if conntrack table is full), refresh timeout to 30s
     3  Origin			# refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
     4  Origin			# refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
     ...
     5  Origin			# refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past


Thus, once DNAT mapping changes, conntrack entry remains alive for as
long as origin sends more packets.

After change:

Packet	 Direction
     1   Origin			# create entry with 30s timeout, save current time
     2   Reply			# flag entry assured (cannot be fast-removed
				# if conntrack table is full), refresh timeout to 30s
     3  Origin			# do not refresh timeout
     4  Origin			# do not refresh timeout
     ...
     5  Origin			# do not refresh timeout

Once 30s have elapsed, this entry will expire, even if packets are seen.
Next packet will create a new conntrack, and a new NAT mapping could be
chosen.


If there is reply traffic, nothing changes:

Packet	 Direction
     1   Origin			# create entry with 30s timeout, save current time
     2   Reply			# flag entry assured (cannot be fast-removed
				# if conntrack table is full), refresh timeout to 30s
     3  Origin			# do not refresh timeout
     4  Reply			# refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
     ...
     5  Origin			# do not refresh timeout

Thats the general idea anyway.
I'll try to make this more robust and also let it timeout if we have
only reply traffic without any origin packets.

> I'm aware of some requests to have this behavior in kubernetes for UDP
> service, but I'm scared that there can be some people relying on this
> behavior ... and all the possible edge cases this will open, what
> happens if we have two UDP packets in one direction before the first
> reply is seen?

Same as what happens now, 2nd packet follows NAT mapping of first one.

  reply	other threads:[~2024-10-18 11:33 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-17 10:26 Most optimal method to dump UDP conntrack entries Antonio Ojea
2024-10-17 12:46 ` Florian Westphal
2024-10-17 16:36   ` Pablo Neira Ayuso
2024-10-17 22:10     ` Antonio Ojea
2024-10-17 23:30       ` Florian Westphal
2024-10-18 11:05         ` Antonio Ojea
2024-10-18 11:33           ` Florian Westphal [this message]
2024-10-18 14:10             ` Antonio Ojea
2024-10-21 13:53               ` Florian Westphal
2024-10-23  9:03                 ` Benny Lyne Amorsen
2024-11-10 21:50                 ` Florian Westphal
2024-11-11  6:33                   ` Antonio Ojea
2024-11-11 12:06       ` Pablo Neira Ayuso
2024-11-11 12:09         ` Florian Westphal
2024-11-11 12:29           ` Pablo Neira Ayuso
2024-11-11 12:54             ` Florian Westphal
2024-11-12  9:16               ` Pablo Neira Ayuso
2024-11-12  9:20                 ` Pablo Neira Ayuso
2024-11-12 14:41                   ` Antonio Ojea
2024-11-12 14:43                     ` Antonio Ojea
2024-11-12 16:18                     ` Florian Westphal
2024-11-15  4:11                       ` Antonio Ojea
2024-12-01 17:00                         ` Antonio Ojea

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241018113318.GA28324@breakpoint.cc \
    --to=fw@strlen.de \
    --cc=antonio.ojea.garcia@gmail.com \
    --cc=netfilter@vger.kernel.org \
    --cc=pablo@netfilter.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.