From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from Chamillionaire.breakpoint.cc (Chamillionaire.breakpoint.cc [91.216.245.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E662927A127 for ; Tue, 23 Sep 2025 17:07:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.216.245.30 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758647226; cv=none; b=O174hdsVehaeYRetQlfq4oHNtsj6X4CQZxYzrTWrw9co+NHCXyHG0OKpBdByUzE1KgZgUwzKeun097u+pMZP741HgImjNXViMY+vRVDIXbL3bRpuhohAkGVRdVXaO1JO8vvpJKLubwtBMgGxi+WsXFXra4a2hHnx7Ped+wEeymM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758647226; c=relaxed/simple; bh=Y6cTg1Jc2LuMU7UDxu/om1etG7PYu0TmxXv77Gl7vaE=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=l3kNKVqJ8u7W6IiEi+JMWyQUj7MZ62YiKGmnAH9r23PN55gvUVImPJVYiGwXNKmK8jScrUtzgoInMvDJ1h49V5kNuXinnyRmchrRoo5qGSlwFjWQJiUwqM6uo3hF45LZTeM7Ypoz2d0CwSGkBLqQv3LY22GiLG111tVEkmwdenU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=strlen.de; spf=pass smtp.mailfrom=strlen.de; arc=none smtp.client-ip=91.216.245.30 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=strlen.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=strlen.de Received: by Chamillionaire.breakpoint.cc (Postfix, from userid 1003) id CCA7F61935; Tue, 23 Sep 2025 19:07:01 +0200 (CEST) Date: Tue, 23 Sep 2025 19:07:01 +0200 From: Florian Westphal To: Antonio Ojea Cc: netfilter@vger.kernel.org Subject: Re: Confirming conntrack behavior on environments with multiple network namespaces Message-ID: References: Precedence: bulk X-Mailing-List: netfilter@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Antonio Ojea wrote: > My assumptions are: > > Global hash table: There is a single global hash table for the entire > system. Its size is governed by nf_conntrack_buckets, and since commit > 3183ab89 ("netfilter: conntrack: allow increasing bucket size via > sysctl too"), it is possible to resize this table at runtime. The hash > key correctly includes the network namespace ID to differentiate per > namespace entries. Yes. > Per-namespace limit: net.netfilter.nf_conntrack_max is a per-namespace > value (inherited from the init_net default) that limits the number of > entries that a single namespace is allowed to create. Yes, there is a patch rotting in backlog that allows non-init netns to shrink it so the lower value is used. As-is, only init_net can change this. I hope I can get to this patch by mid of october. > Kubernetes "double-accounting": In Kubernetes, if a pod/container > enables conntrack internally (e.g., a service mesh sidecar or network > function), a single application connection will create two conntrack > entries in the global table: one for the pod's namespace and one for > the root namespace (for node-level routing/NAT). 2 conntrack entries -> 4 hash entries (origin, reply). > Monitoring the load factor: To monitor the system state (as suggested > by commit c77737b's focus on chain length), a key metric is the Load > Factor: global_count / nf_conntrack_buckets. Yes. > I was surprised to find that reading > /proc/sys/net/netfilter/nf_conntrack_count from the root namespace > (init_net) returns the global sum of all entries. Reading the same > file from inside a pod's namespace returns the per-namespace count. No, its always per namespace, even in init_net. Try not enabling conntrack in init_net. Or add 'notrack' rules in init_net for the netns originating traffic. Or create a netns, enable conntrack and then only create connections to addresses reachable via loopback. In all these cases init_net won't be affected by the net namespace. > Is this behavior correct? It seems to be the simplest and most > accurate way to monitor the global Load Factor is doing something like > > watch -n 2 "awk 'NR==1{count=\$1} NR==2{buckets=\$1; printf > \"Connections: %-10d | Buckets: %-10d | Load Factor: %.2f\n\", count, > buckets, count/buckets}' /proc/sys/net/netfilter/nf_conntrack_count > /proc/sys/net/netfilter/nf_conntrack_buckets" That will only monitor that namespace, I'm not aware of a global counter. > Causes of bloat: One of the common causes of high connection counts > seems to be the very long default timeout for established connections > (nf_conntrack_tcp_timeout_established = 5 days). However, it seems > most distros override the default and set it to a lower number by > default, like one hour. 5 days is the historic value inherited from ip_conntrack. I don't know of a good default value to use instead. Changing it to 1h seems to short for a default value, changing it to 4 days looks pointless to me. > Per-namespace statistics: Finally, is there a recommended, > programmatic way to get a breakdown of conntrack statistics per > namespace? I'm especially interested in the total number of entries > for each namespace. We know we can nsenter each network namespace and > read its local sysctl, but this is very inefficient. Is there a better > way to get this breakdown from the root namespace, perhaps via > netlink, procfs, or eBPF with something like bpftrace? Not at this time. There is ctnetlink but it requires the nsenter dance.