cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kuniyuki Iwashima <kuniyu@google.com>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: "Eric Dumazet" <edumazet@google.com>,
	"Michal Koutný" <mkoutny@suse.com>, "Tejun Heo" <tj@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Neal Cardwell" <ncardwell@google.com>,
	"Paolo Abeni" <pabeni@redhat.com>,
	"Willem de Bruijn" <willemb@google.com>,
	"Matthieu Baerts" <matttbe@kernel.org>,
	"Mat Martineau" <martineau@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Roman Gushchin" <roman.gushchin@linux.dev>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Simon Horman" <horms@kernel.org>,
	"Geliang Tang" <geliang@kernel.org>,
	"Muchun Song" <muchun.song@linux.dev>,
	"Kuniyuki Iwashima" <kuni1840@gmail.com>,
	netdev@vger.kernel.org, mptcp@lists.linux.dev,
	cgroups@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
Date: Wed, 23 Jul 2025 11:06:14 -0700	[thread overview]
Message-ID: <CAAVpQUDMj_1p6sVeo=bZ_u34HSX7V3WM6hYG3wHyyCACKrTKmQ@mail.gmail.com> (raw)
In-Reply-To: <e6qunyonbd4yxgf3g7gyc4435ueez6ledshde6lfdq7j5nslsh@xl7mcmaczfmk>

On Wed, Jul 23, 2025 at 10:28 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> Cc Tejun & Michal to get their opinion on memcg vs cgroup vs BPF
> options.
>
> On Tue, Jul 22, 2025 at 07:35:52PM -0700, Kuniyuki Iwashima wrote:
> [...]
> > >
> > > Running workloads in root cgroup is not normal and comes with a warning
> > > of no isolation provided.
> > >
> > > I looked at the patch again to understand the modes you are introducing.
> > > Initially, I thought the series introduced multiple modes, including an
> > > option to exclude network memory from memcg accounting. However, if I
> > > understand correctly, that is not the case—the opt-out applies only to
> > > the global TCP/UDP accounting. That’s a relief, and I apologize for the
> > > misunderstanding.
> > >
> > > If I’m correct, you need a way to exclude a workload from the global
> > > TCP/UDP accounting, and currently, memcg serves as a convenient
> > > abstraction for the workload. Please let me know if I misunderstood.
> >
> > Correct.
> >
> > Currently, memcg by itself cannot guarantee that memory allocation for
> > socket buffer does not fail even when memory.current < memory.max
> > due to the global protocol limits.
> >
> > It means we need to increase the global limits to
> >
> > (bytes of TCP socket buffer in each cgroup) * (number of cgroup)
> >
> > , which is hard to predict, and I guess that's the reason why you
> > or Wei set tcp_mem[] to UINT_MAX so that we can ignore the global
> > limit.
>
> No that was not the reason. The main reason behind max tcp_mem global
> limit was it was not needed

but the global limit did take place thus you had to set tcp_mem
to unlimited.

> as memcg should account and limit the
> network memory.
> I think the reason you don't want tcp_mem global limit
> unlimited now is

memcg has been subject to the global limit from day 0.

And note that not every process is under memcg with memory.max
configured.


> you have internal feature to let workloads opt out of
> the memcg accounting of network memory which is causing isolation
> issues.
>
> >
> > But we should keep tcp_mem[] within a sane range in the first place.
> >
> > This series allows us to configure memcg limits only and let memcg
> > guarantee no failure until it fully consumes memory.max.
> >
> > The point is that memcg should not be affected by the global limits,
> > and this is orthogonal with the assumption that every workload should
> > be running under memcg.
> >
> >
> > >
> > > Now memcg is one way to represent the workload. Another more natural, at
> > > least to me, is the core cgroup. Basically cgroup.something interface.
> > > BPF is yet another option.
> > >
> > > To me cgroup seems preferrable but let's see what other memcg & cgroup
> > > folks think. Also note that for cgroup and memcg the interface will need
> > > to be hierarchical.
> >
> > As the root cgroup doesn't have the knob, these combinations are
> > considered hierarchical:
> >
> > (parent, child) = (0, 0), (0, 1), (1, 1)
> >
> > and only the pattern below is not considered hierarchical
> >
> > (parent, child) = (1, 0)
> >
> > Let's say we lock the knob at the first socket creation like your
> > idea above.
> >
> > If a parent and its child' knobs are (0, 0) and the child creates a
> > socket, the child memcg is locked as 0.  When the parent enables
> > the knob, we must check all child cgroups as well.  Or, we lock
> > the all parents' knobs when a socket is created in a child cgroup
> > with knob=0 ?  In any cases we need a global lock.
> >
> > Well, I understand that the hierarchical semantics is preferable
> > for cgroup but I think it does not resolve any real issue and rather
> > churns the code unnecessarily.
>
> All this is implementation detail and I am asking about semantics. More
> specifically:
>
> 1. Will the root be non-isolated always?

Yes, because the root cgroup doesn't have memcg.
Also, the knob has CFTYPE_NOT_ON_ROOT.


> 2. If a cgroup is isolated, does it mean all its desendants are
>    isolated?

No, but this is because we MUST think about how we handle
the scenario above that (parent, child) = (0,0) becomes (1, 0).

We cannot think about the semantics without implementation
detail.  And if we allow such scenario, the hierarchical semantics
is fake and has no meaning.


> 3. Will there ever be a reasonable use-case where there is non-isolated
>    sub-tree under an isolated ancestor?

I think no, but again, we need to think about the scenario above,
otherwise, your ideal semantics is just broken.

Also, "no reasonable scenario" does not always mean "we must
prevent the scenario".

If there's nothing harmful, we can just let it be, especially if such
restriction gives nothing andrather hurts performance with no
good reason.


>
> Please give some thought to the above (and related) questions.

Please think about the implementation detail and if its trade-off
(just keeping semantics vs code churn & perf regression) makes
really sense.


>
> I am still not convinced that memcg is the right home for this opt-out
> feature. I have CCed cgroup folks to get their opinion as well.

  reply	other threads:[~2025-07-23 18:06 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
2025-07-21 20:35 ` [PATCH v1 net-next 01/13] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n Kuniyuki Iwashima
2025-07-22 14:30   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 02/13] mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready() Kuniyuki Iwashima
2025-07-22 14:33   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 03/13] tcp: Simplify error path in inet_csk_accept() Kuniyuki Iwashima
2025-07-22 14:34   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 04/13] net: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV Kuniyuki Iwashima
2025-07-22 14:37   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 05/13] net: Clean up __sk_mem_raise_allocated() Kuniyuki Iwashima
2025-07-22 14:38   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 06/13] net-memcg: Introduce mem_cgroup_from_sk() Kuniyuki Iwashima
2025-07-22 14:39   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 07/13] net-memcg: Introduce mem_cgroup_sk_enabled() Kuniyuki Iwashima
2025-07-22 14:40   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 08/13] net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge() Kuniyuki Iwashima
2025-07-22 14:56   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 09/13] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure() Kuniyuki Iwashima
2025-07-22 14:58   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 10/13] net: Define sk_memcg under CONFIG_MEMCG Kuniyuki Iwashima
2025-07-22 14:58   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 11/13] net-memcg: Add memory.socket_isolated knob Kuniyuki Iwashima
2025-07-22 15:00   ` Eric Dumazet
2025-07-31 13:39   ` Michal Koutný
2025-07-21 20:35 ` [PATCH v1 net-next 12/13] net-memcg: Store memcg->socket_isolated in sk->sk_memcg Kuniyuki Iwashima
2025-07-22 15:02   ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
2025-07-22 15:14   ` Shakeel Butt
2025-07-22 15:24     ` Eric Dumazet
2025-07-22 15:52       ` Shakeel Butt
2025-07-22 18:18         ` Kuniyuki Iwashima
2025-07-22 18:47           ` Shakeel Butt
2025-07-22 19:03             ` Kuniyuki Iwashima
2025-07-22 19:56               ` Shakeel Butt
2025-07-22 21:59                 ` Kuniyuki Iwashima
2025-07-23  0:29                   ` Shakeel Butt
2025-07-23  2:35                     ` Kuniyuki Iwashima
2025-07-23 17:28                       ` Shakeel Butt
2025-07-23 18:06                         ` Kuniyuki Iwashima [this message]
2025-07-25  1:49                           ` Jakub Kicinski
2025-07-25 18:50                             ` Kuniyuki Iwashima
2025-07-28 16:07   ` Johannes Weiner
2025-07-28 21:41     ` Kuniyuki Iwashima
2025-07-29 14:22       ` Johannes Weiner
2025-07-29 19:41         ` Kuniyuki Iwashima
2025-07-31  2:58   ` Roman Gushchin
2025-07-31 13:38   ` Michal Koutný
2025-07-31 23:51     ` Kuniyuki Iwashima
2025-08-01  7:00       ` Michal Koutný
2025-08-01 16:27         ` Kuniyuki Iwashima
2025-07-22 15:04 ` [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Shakeel Butt
2025-07-22 15:34   ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAAVpQUDMj_1p6sVeo=bZ_u34HSX7V3WM6hYG3wHyyCACKrTKmQ@mail.gmail.com' \
    --to=kuniyu@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=geliang@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=horms@kernel.org \
    --cc=kuba@kernel.org \
    --cc=kuni1840@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=martineau@kernel.org \
    --cc=matttbe@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=mptcp@lists.linux.dev \
    --cc=muchun.song@linux.dev \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).