Re: Confirming conntrack behavior on environments with multiple network namespaces

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Florian Westphal <fw@strlen.de>
To: Antonio Ojea <antonio.ojea.garcia@gmail.com>
Cc: netfilter@vger.kernel.org
Subject: Re: Confirming conntrack behavior on environments with multiple network namespaces
Date: Tue, 23 Sep 2025 19:07:01 +0200	[thread overview]
Message-ID: <aNLTtcdGNJT8FuYk@strlen.de> (raw)
In-Reply-To: <CABhP=tYo-R_vFJhfUGj_GmW5mCAFR3VNEkkhJDp7pHpHNYvD5g@mail.gmail.com>

Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> My assumptions are:
> 
> Global hash table: There is a single global hash table for the entire
> system. Its size is governed by nf_conntrack_buckets, and since commit
> 3183ab89 ("netfilter: conntrack: allow increasing bucket size via
> sysctl too"), it is possible to resize this table at runtime. The hash
> key correctly includes the network namespace ID to differentiate per
> namespace entries.

Yes.

> Per-namespace limit: net.netfilter.nf_conntrack_max is a per-namespace
> value (inherited from the init_net default) that limits the number of
> entries that a single namespace is allowed to create.

Yes, there is a patch rotting in backlog that allows non-init netns to
shrink it so the lower value is used.  As-is, only init_net can change
this.

I hope I can get to this patch by mid of october.

> Kubernetes "double-accounting": In Kubernetes, if a pod/container
> enables conntrack internally (e.g., a service mesh sidecar or network
> function), a single application connection will create two conntrack
> entries in the global table: one for the pod's namespace and one for
> the root namespace (for node-level routing/NAT).

2 conntrack entries -> 4 hash entries (origin, reply).

> Monitoring the load factor: To monitor the system state (as suggested
> by commit c77737b's focus on chain length), a key metric is the Load
> Factor: global_count / nf_conntrack_buckets.

Yes.

> I was surprised to find that reading
> /proc/sys/net/netfilter/nf_conntrack_count from the root namespace
> (init_net) returns the global sum of all entries. Reading the same
> file from inside a pod's namespace returns the per-namespace count.

No, its always per namespace, even in init_net.
Try not enabling conntrack in init_net.

Or add 'notrack' rules in init_net for the netns originating traffic.
Or create a netns, enable conntrack and then only create connections
to addresses reachable via loopback. In all these cases init_net won't
be affected by the net namespace.

> Is this behavior correct? It seems to be the simplest and most
> accurate way to monitor the global Load Factor is doing something like
> 
> watch -n 2 "awk 'NR==1{count=\$1} NR==2{buckets=\$1; printf
> \"Connections: %-10d | Buckets: %-10d | Load Factor: %.2f\n\", count,
> buckets, count/buckets}' /proc/sys/net/netfilter/nf_conntrack_count
> /proc/sys/net/netfilter/nf_conntrack_buckets"

That will only monitor that namespace, I'm not aware of a global counter.

> Causes of bloat: One of the common causes of high connection counts
> seems to be the very long default timeout for established connections
> (nf_conntrack_tcp_timeout_established = 5 days). However, it seems
> most distros override the default and set it to a lower number by
> default, like one hour.

5 days is the historic value inherited from ip_conntrack.
I don't know of a good default value to use instead.

Changing it to 1h seems to short for a default value, changing it to
4 days looks pointless to me.

> Per-namespace statistics: Finally, is there a recommended,
> programmatic way to get a breakdown of conntrack statistics per
> namespace? I'm especially interested in the total number of entries
> for each namespace. We know we can nsenter each network namespace and
> read its local sysctl, but this is very inefficient. Is there a better
> way to get this breakdown from the root namespace, perhaps via
> netlink, procfs, or eBPF with something like bpftrace?

Not at this time.  There is ctnetlink but it requires the nsenter
dance.

next prev parent reply	other threads:[~2025-09-23 17:07 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-22  9:38 Confirming conntrack behavior on environments with multiple network namespaces Antonio Ojea
2025-09-23 17:07 ` Florian Westphal [this message]
2025-09-26 13:10   ` Antonio Ojea
2025-09-26 14:03     ` Florian Westphal
2025-09-26 14:41       ` Antonio Ojea
2025-09-30 19:13         ` Florian Westphal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aNLTtcdGNJT8FuYk@strlen.de \
    --to=fw@strlen.de \
    --cc=antonio.ojea.garcia@gmail.com \
    --cc=netfilter@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.