public inbox for netfilter@vger.kernel.org
 help / color / mirror / Atom feed
* Confirming conntrack behavior on environments with multiple network namespaces
@ 2025-09-22  9:38 Antonio Ojea
  2025-09-23 17:07 ` Florian Westphal
  0 siblings, 1 reply; 6+ messages in thread
From: Antonio Ojea @ 2025-09-22  9:38 UTC (permalink / raw)
  To: netfilter

Hello Netfilter developers,

I've been debugging conntrack performance issues in Kubernetes
recently and it will be great to confirm my understanding from my
observations, testing and public information I could read.

My assumptions are:

Global hash table: There is a single global hash table for the entire
system. Its size is governed by nf_conntrack_buckets, and since commit
3183ab89 ("netfilter: conntrack: allow increasing bucket size via
sysctl too"), it is possible to resize this table at runtime. The hash
key correctly includes the network namespace ID to differentiate per
namespace entries.

Per-namespace limit: net.netfilter.nf_conntrack_max is a per-namespace
value (inherited from the init_net default) that limits the number of
entries that a single namespace is allowed to create.

Kubernetes "double-accounting": In Kubernetes, if a pod/container
enables conntrack internally (e.g., a service mesh sidecar or network
function), a single application connection will create two conntrack
entries in the global table: one for the pod's namespace and one for
the root namespace (for node-level routing/NAT).

Monitoring the load factor: To monitor the system state (as suggested
by commit c77737b's focus on chain length), a key metric is the Load
Factor: global_count / nf_conntrack_buckets.

I was surprised to find that reading
/proc/sys/net/netfilter/nf_conntrack_count from the root namespace
(init_net) returns the global sum of all entries. Reading the same
file from inside a pod's namespace returns the per-namespace count.

Is this behavior correct? It seems to be the simplest and most
accurate way to monitor the global Load Factor is doing something like

watch -n 2 "awk 'NR==1{count=\$1} NR==2{buckets=\$1; printf
\"Connections: %-10d | Buckets: %-10d | Load Factor: %.2f\n\", count,
buckets, count/buckets}' /proc/sys/net/netfilter/nf_conntrack_count
/proc/sys/net/netfilter/nf_conntrack_buckets"


Causes of bloat: One of the common causes of high connection counts
seems to be the very long default timeout for established connections
(nf_conntrack_tcp_timeout_established = 5 days). However, it seems
most distros override the default and set it to a lower number by
default, like one hour.

I'm also looking at nf_conntrack_tcp_timeout_time_wait (120s) vs. the
kernel's 60s TIME_WAIT, though I'm less clear on the implications of
tuning this.

Per-namespace statistics: Finally, is there a recommended,
programmatic way to get a breakdown of conntrack statistics per
namespace? I'm especially interested in the total number of entries
for each namespace. We know we can nsenter each network namespace and
read its local sysctl, but this is very inefficient. Is there a better
way to get this breakdown from the root namespace, perhaps via
netlink, procfs, or eBPF with something like bpftrace?

Does this analysis seem correct to you? Any confirmation, advice or
correction would be incredibly helpful.

Thank you,
Antonio Ojea

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-09-30 19:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-22  9:38 Confirming conntrack behavior on environments with multiple network namespaces Antonio Ojea
2025-09-23 17:07 ` Florian Westphal
2025-09-26 13:10   ` Antonio Ojea
2025-09-26 14:03     ` Florian Westphal
2025-09-26 14:41       ` Antonio Ojea
2025-09-30 19:13         ` Florian Westphal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox