cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kuniyuki Iwashima <kuniyu@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	 Jakub Kicinski <kuba@kernel.org>,
	Neal Cardwell <ncardwell@google.com>,
	Paolo Abeni <pabeni@redhat.com>,
	 Willem de Bruijn <willemb@google.com>,
	Matthieu Baerts <matttbe@kernel.org>,
	 Mat Martineau <martineau@kernel.org>,
	Michal Hocko <mhocko@kernel.org>,
	 Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Simon Horman <horms@kernel.org>,
	 Geliang Tang <geliang@kernel.org>,
	Muchun Song <muchun.song@linux.dev>,
	 Kuniyuki Iwashima <kuni1840@gmail.com>,
	netdev@vger.kernel.org, mptcp@lists.linux.dev,
	 cgroups@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
Date: Mon, 28 Jul 2025 14:41:38 -0700	[thread overview]
Message-ID: <CAAVpQUBYsRGkYsvf2JMTD+0t8OH41oZxmw46WTfPhEprTaS+Pw@mail.gmail.com> (raw)
In-Reply-To: <20250728160737.GE54289@cmpxchg.org>

On Mon, Jul 28, 2025 at 9:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > buffers and charge memory to per-protocol global counters pointed to by
> > sk->sk_proto->memory_allocated.
> >
> > When running under a non-root cgroup, this memory is also charged to the
> > memcg as sock in memory.stat.
> >
> > Even when memory usage is controlled by memcg, sockets using such protocols
> > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> >
> > This makes it difficult to accurately estimate and configure appropriate
> > global limits, especially in multi-tenant environments.
> >
> > If all workloads were guaranteed to be controlled under memcg, the issue
> > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> >
> > In reality, this assumption does not always hold, and a single workload
> > that opts out of memcg can consume memory up to the global limit,
> > becoming a noisy neighbour.
>
> Yes, an uncontrolled cgroup can consume all of a shared resource and
> thereby become a noisy neighbor. Why is network memory special?
>
> I assume you have some other mechanisms for curbing things like
> filesystem caches, anon memory, swap etc. of such otherwise
> uncontrolled groups, and this just happens to be your missing piece.

I think that's the tcp_mem[] knob, limiting tcp mem globally for
the "uncontrolled" cgroup.  But we can't use it because the
"controlled" cgroup is also limited by this knob.

If we want to properly control the "controlled" cgroup by its feature
only, we must disable the global limit completely on the host,
meaning we lose the "missing piece".

Currently, there are only two poor choices

1) Use tcp_mem[] but memory allocation could fail even if the
   cgroup has available memory

2) Disable tcp_mem[] but uncontrolled cgroup lose seatbelt and
   can consume memory up to system limit

but what we really need is

3) Uncontrolled cgroup is limited by tcp_mem[],
   AND
   for controlled cgroup, memory allocation won't fail if
   it has available memory regardless of tcp_mem[]


>
> But at this point, you're operating so far out of the cgroup resource
> management model that I don't think it can be reasonably supported.

I think it's rather operated under the normal cgroup management
model, relying on the configured memory limit for each cgroup.

What's wrong here is we had to set tcp_mem[] to UINT_MAX and
get rid of the seatbelt for uncontrolled cgroup for the management
model.

But this is just because cgroup mem is also charged globally
to TCP, which should not be.


>
> I hate to say this, but can't you carry this out of tree until the
> transition is complete?
>
> I just don't think it makes any sense to have this as a permanent
> fixture in a general-purpose container management interface.

I understand that, and we should eventually fix "1) or 2)" to
just 3), but introducing this change without a knob will break
assumptions in userspace and trigger regression.

cgroup v2 is now widely enabled by major distro, and systemd
creates many processes under non-root cgroups but without
memory limits.

If we had no knob, such processes would suddenly lose the
tcp_mem[] seatbelt and could consume memory up to system
limit.

How about adding the knob's deprecation plan by pr_warn_once()
or something and letting users configure the max properly by
that ?

  reply	other threads:[~2025-07-28 21:41 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
2025-07-21 20:35 ` [PATCH v1 net-next 01/13] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n Kuniyuki Iwashima
2025-07-22 14:30   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 02/13] mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready() Kuniyuki Iwashima
2025-07-22 14:33   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 03/13] tcp: Simplify error path in inet_csk_accept() Kuniyuki Iwashima
2025-07-22 14:34   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 04/13] net: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV Kuniyuki Iwashima
2025-07-22 14:37   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 05/13] net: Clean up __sk_mem_raise_allocated() Kuniyuki Iwashima
2025-07-22 14:38   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 06/13] net-memcg: Introduce mem_cgroup_from_sk() Kuniyuki Iwashima
2025-07-22 14:39   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 07/13] net-memcg: Introduce mem_cgroup_sk_enabled() Kuniyuki Iwashima
2025-07-22 14:40   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 08/13] net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge() Kuniyuki Iwashima
2025-07-22 14:56   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 09/13] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure() Kuniyuki Iwashima
2025-07-22 14:58   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 10/13] net: Define sk_memcg under CONFIG_MEMCG Kuniyuki Iwashima
2025-07-22 14:58   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 11/13] net-memcg: Add memory.socket_isolated knob Kuniyuki Iwashima
2025-07-22 15:00   ` Eric Dumazet
2025-07-31 13:39   ` Michal Koutný
2025-07-21 20:35 ` [PATCH v1 net-next 12/13] net-memcg: Store memcg->socket_isolated in sk->sk_memcg Kuniyuki Iwashima
2025-07-22 15:02   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
2025-07-22 15:14   ` Shakeel Butt
2025-07-22 15:24     ` Eric Dumazet
2025-07-22 15:52       ` Shakeel Butt
2025-07-22 18:18         ` Kuniyuki Iwashima
2025-07-22 18:47           ` Shakeel Butt
2025-07-22 19:03             ` Kuniyuki Iwashima
2025-07-22 19:56               ` Shakeel Butt
2025-07-22 21:59                 ` Kuniyuki Iwashima
2025-07-23  0:29                   ` Shakeel Butt
2025-07-23  2:35                     ` Kuniyuki Iwashima
2025-07-23 17:28                       ` Shakeel Butt
2025-07-23 18:06                         ` Kuniyuki Iwashima
2025-07-25  1:49                           ` Jakub Kicinski
2025-07-25 18:50                             ` Kuniyuki Iwashima
2025-07-28 16:07   ` Johannes Weiner
2025-07-28 21:41     ` Kuniyuki Iwashima [this message]
2025-07-29 14:22       ` Johannes Weiner
2025-07-29 19:41         ` Kuniyuki Iwashima
2025-07-31  2:58   ` Roman Gushchin
2025-07-31 13:38   ` Michal Koutný
2025-07-31 23:51     ` Kuniyuki Iwashima
2025-08-01  7:00       ` Michal Koutný
2025-08-01 16:27         ` Kuniyuki Iwashima
2025-07-22 15:04 ` [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Shakeel Butt
2025-07-22 15:34   ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAAVpQUBYsRGkYsvf2JMTD+0t8OH41oZxmw46WTfPhEprTaS+Pw@mail.gmail.com \
    --to=kuniyu@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=geliang@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=horms@kernel.org \
    --cc=kuba@kernel.org \
    --cc=kuni1840@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=martineau@kernel.org \
    --cc=matttbe@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=mptcp@lists.linux.dev \
    --cc=muchun.song@linux.dev \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).