Confirming conntrack behavior on environments with multiple network namespaces

public inbox for netfilter@vger.kernel.org
 help / color / mirror / Atom feed

* Confirming conntrack behavior on environments with multiple network namespaces
@ 2025-09-22  9:38 Antonio Ojea
  2025-09-23 17:07 ` Florian Westphal
  0 siblings, 1 reply; 6+ messages in thread
From: Antonio Ojea @ 2025-09-22  9:38 UTC (permalink / raw)
  To: netfilter

Hello Netfilter developers,

I've been debugging conntrack performance issues in Kubernetes
recently and it will be great to confirm my understanding from my
observations, testing and public information I could read.

My assumptions are:

Global hash table: There is a single global hash table for the entire
system. Its size is governed by nf_conntrack_buckets, and since commit
3183ab89 ("netfilter: conntrack: allow increasing bucket size via
sysctl too"), it is possible to resize this table at runtime. The hash
key correctly includes the network namespace ID to differentiate per
namespace entries.

Per-namespace limit: net.netfilter.nf_conntrack_max is a per-namespace
value (inherited from the init_net default) that limits the number of
entries that a single namespace is allowed to create.

Kubernetes "double-accounting": In Kubernetes, if a pod/container
enables conntrack internally (e.g., a service mesh sidecar or network
function), a single application connection will create two conntrack
entries in the global table: one for the pod's namespace and one for
the root namespace (for node-level routing/NAT).

Monitoring the load factor: To monitor the system state (as suggested
by commit c77737b's focus on chain length), a key metric is the Load
Factor: global_count / nf_conntrack_buckets.

I was surprised to find that reading
/proc/sys/net/netfilter/nf_conntrack_count from the root namespace
(init_net) returns the global sum of all entries. Reading the same
file from inside a pod's namespace returns the per-namespace count.

Is this behavior correct? It seems to be the simplest and most
accurate way to monitor the global Load Factor is doing something like

watch -n 2 "awk 'NR==1{count=\$1} NR==2{buckets=\$1; printf
\"Connections: %-10d | Buckets: %-10d | Load Factor: %.2f\n\", count,
buckets, count/buckets}' /proc/sys/net/netfilter/nf_conntrack_count
/proc/sys/net/netfilter/nf_conntrack_buckets"

Causes of bloat: One of the common causes of high connection counts
seems to be the very long default timeout for established connections
(nf_conntrack_tcp_timeout_established = 5 days). However, it seems
most distros override the default and set it to a lower number by
default, like one hour.

I'm also looking at nf_conntrack_tcp_timeout_time_wait (120s) vs. the
kernel's 60s TIME_WAIT, though I'm less clear on the implications of
tuning this.

Per-namespace statistics: Finally, is there a recommended,
programmatic way to get a breakdown of conntrack statistics per
namespace? I'm especially interested in the total number of entries
for each namespace. We know we can nsenter each network namespace and
read its local sysctl, but this is very inefficient. Is there a better
way to get this breakdown from the root namespace, perhaps via
netlink, procfs, or eBPF with something like bpftrace?

Does this analysis seem correct to you? Any confirmation, advice or
correction would be incredibly helpful.

Thank you,
Antonio Ojea

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Confirming conntrack behavior on environments with multiple network namespaces
  2025-09-22  9:38 Confirming conntrack behavior on environments with multiple network namespaces Antonio Ojea
@ 2025-09-23 17:07 ` Florian Westphal
  2025-09-26 13:10   ` Antonio Ojea
  0 siblings, 1 reply; 6+ messages in thread
From: Florian Westphal @ 2025-09-23 17:07 UTC (permalink / raw)
  To: Antonio Ojea; +Cc: netfilter

Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> My assumptions are:
> 
> Global hash table: There is a single global hash table for the entire
> system. Its size is governed by nf_conntrack_buckets, and since commit
> 3183ab89 ("netfilter: conntrack: allow increasing bucket size via
> sysctl too"), it is possible to resize this table at runtime. The hash
> key correctly includes the network namespace ID to differentiate per
> namespace entries.

Yes.

> Per-namespace limit: net.netfilter.nf_conntrack_max is a per-namespace
> value (inherited from the init_net default) that limits the number of
> entries that a single namespace is allowed to create.

Yes, there is a patch rotting in backlog that allows non-init netns to
shrink it so the lower value is used.  As-is, only init_net can change
this.

I hope I can get to this patch by mid of october.

> Kubernetes "double-accounting": In Kubernetes, if a pod/container
> enables conntrack internally (e.g., a service mesh sidecar or network
> function), a single application connection will create two conntrack
> entries in the global table: one for the pod's namespace and one for
> the root namespace (for node-level routing/NAT).

2 conntrack entries -> 4 hash entries (origin, reply).

> Monitoring the load factor: To monitor the system state (as suggested
> by commit c77737b's focus on chain length), a key metric is the Load
> Factor: global_count / nf_conntrack_buckets.

Yes.

> I was surprised to find that reading
> /proc/sys/net/netfilter/nf_conntrack_count from the root namespace
> (init_net) returns the global sum of all entries. Reading the same
> file from inside a pod's namespace returns the per-namespace count.

No, its always per namespace, even in init_net.
Try not enabling conntrack in init_net.

Or add 'notrack' rules in init_net for the netns originating traffic.
Or create a netns, enable conntrack and then only create connections
to addresses reachable via loopback. In all these cases init_net won't
be affected by the net namespace.

> Is this behavior correct? It seems to be the simplest and most
> accurate way to monitor the global Load Factor is doing something like
> 
> watch -n 2 "awk 'NR==1{count=\$1} NR==2{buckets=\$1; printf
> \"Connections: %-10d | Buckets: %-10d | Load Factor: %.2f\n\", count,
> buckets, count/buckets}' /proc/sys/net/netfilter/nf_conntrack_count
> /proc/sys/net/netfilter/nf_conntrack_buckets"

That will only monitor that namespace, I'm not aware of a global counter.

> Causes of bloat: One of the common causes of high connection counts
> seems to be the very long default timeout for established connections
> (nf_conntrack_tcp_timeout_established = 5 days). However, it seems
> most distros override the default and set it to a lower number by
> default, like one hour.

5 days is the historic value inherited from ip_conntrack.
I don't know of a good default value to use instead.

Changing it to 1h seems to short for a default value, changing it to
4 days looks pointless to me.

> Per-namespace statistics: Finally, is there a recommended,
> programmatic way to get a breakdown of conntrack statistics per
> namespace? I'm especially interested in the total number of entries
> for each namespace. We know we can nsenter each network namespace and
> read its local sysctl, but this is very inefficient. Is there a better
> way to get this breakdown from the root namespace, perhaps via
> netlink, procfs, or eBPF with something like bpftrace?

Not at this time.  There is ctnetlink but it requires the nsenter
dance.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Confirming conntrack behavior on environments with multiple network namespaces
  2025-09-23 17:07 ` Florian Westphal
@ 2025-09-26 13:10   ` Antonio Ojea
  2025-09-26 14:03     ` Florian Westphal
  0 siblings, 1 reply; 6+ messages in thread
From: Antonio Ojea @ 2025-09-26 13:10 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter

On Tue, 23 Sept 2025 at 19:07, Florian Westphal <fw@strlen.de> wrote:
>
> Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:

>
> > I was surprised to find that reading
> > /proc/sys/net/netfilter/nf_conntrack_count from the root namespace
> > (init_net) returns the global sum of all entries. Reading the same
> > file from inside a pod's namespace returns the per-namespace count.
>
> No, its always per namespace, even in init_net.

heh, my bad, I had a bug in the script I used for testing and this
makes more sense, thanks for clarifying

> Try not enabling conntrack in init_net.

> Or add 'notrack' rules in init_net for the netns originating traffic.
> Or create a netns, enable conntrack and then only create connections
> to addresses reachable via loopback. In all these cases init_net won't
> be affected by the net namespace.

It will be hard to disable conntrack in init_net, a lot of
functionality relies on NAT in kubernetes and is not owned by the same
component

>
> That will only monitor that namespace, I'm not aware of a global counter.
>

Can this be a feature request? having a global counter can easily
monitor the load factor of the system, but I'm not familiar with
kernel conventions and having some global counter may be discouraged

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Confirming conntrack behavior on environments with multiple network namespaces
  2025-09-26 13:10   ` Antonio Ojea
@ 2025-09-26 14:03     ` Florian Westphal
  2025-09-26 14:41       ` Antonio Ojea
  0 siblings, 1 reply; 6+ messages in thread
From: Florian Westphal @ 2025-09-26 14:03 UTC (permalink / raw)
  To: Antonio Ojea; +Cc: netfilter

Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> > Or add 'notrack' rules in init_net for the netns originating traffic.
> > Or create a netns, enable conntrack and then only create connections
> > to addresses reachable via loopback. In all these cases init_net won't
> > be affected by the net namespace.
> 
> It will be hard to disable conntrack in init_net, a lot of
> functionality relies on NAT in kubernetes and is not owned by the same
> component

Sure, I understand that no-conntrack-in-init-but-in-another-namespace
is atypical.

> > That will only monitor that namespace, I'm not aware of a global counter.
> >
> 
> Can this be a feature request? having a global counter can easily
> monitor the load factor of the system, but I'm not familiar with
> kernel conventions and having some global counter may be discouraged

Not sure we should add one, it will result in some overhead for no good
reason.  Maybe /proc/slabinfo is enough for your use case?

nf_conntrack has its own (global) memory pool, it should provide a
reasonably good estimate across all netns.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Confirming conntrack behavior on environments with multiple network namespaces
  2025-09-26 14:03     ` Florian Westphal
@ 2025-09-26 14:41       ` Antonio Ojea
  2025-09-30 19:13         ` Florian Westphal
  0 siblings, 1 reply; 6+ messages in thread
From: Antonio Ojea @ 2025-09-26 14:41 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter

On Fri, 26 Sept 2025 at 16:03, Florian Westphal <fw@strlen.de> wrote:

> Not sure we should add one, it will result in some overhead for no good
> reason.  Maybe /proc/slabinfo is enough for your use case?
>
> nf_conntrack has its own (global) memory pool, it should provide a
> reasonably good estimate across all netns.

interesting, just to confirm, can I approximate the load factor using
active_objects / buckets, with

active_objects=$(sudo awk '$1 == "nf_conntrack" {print $2}' /proc/slabinfo)
buckets=$(cat /proc/sys/net/netfilter/nf_conntrack_buckets)

that sounds really good

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Confirming conntrack behavior on environments with multiple network namespaces
  2025-09-26 14:41       ` Antonio Ojea
@ 2025-09-30 19:13         ` Florian Westphal
  0 siblings, 0 replies; 6+ messages in thread
From: Florian Westphal @ 2025-09-30 19:13 UTC (permalink / raw)
  To: Antonio Ojea; +Cc: netfilter

Antonio Ojea <antonio.ojea.garcia@gmail.com> wrote:
> interesting, just to confirm, can I approximate the load factor using
> active_objects / buckets, with
> 
> active_objects=$(sudo awk '$1 == "nf_conntrack" {print $2}' /proc/slabinfo)
> buckets=$(cat /proc/sys/net/netfilter/nf_conntrack_buckets)

Yes, but keep in mind that each entry is hashed twice, once for original
direction and once for reply.

The active_objects count will be higher than what ends up in the table
(e.g., if first packet of connection is dropped it is never inserted).

But it should be a good estimate.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-09-30 19:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-22  9:38 Confirming conntrack behavior on environments with multiple network namespaces Antonio Ojea
2025-09-23 17:07 ` Florian Westphal
2025-09-26 13:10   ` Antonio Ojea
2025-09-26 14:03     ` Florian Westphal
2025-09-26 14:41       ` Antonio Ojea
2025-09-30 19:13         ` Florian Westphal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox