From: Kuniyuki Iwashima <kuniyu@google.com>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Eric Dumazet <edumazet@google.com>,
"David S. Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>,
Neal Cardwell <ncardwell@google.com>,
Paolo Abeni <pabeni@redhat.com>,
Willem de Bruijn <willemb@google.com>,
Matthieu Baerts <matttbe@kernel.org>,
Mat Martineau <martineau@kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
Simon Horman <horms@kernel.org>,
Geliang Tang <geliang@kernel.org>,
Muchun Song <muchun.song@linux.dev>,
Kuniyuki Iwashima <kuni1840@gmail.com>,
netdev@vger.kernel.org, mptcp@lists.linux.dev,
cgroups@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
Date: Tue, 22 Jul 2025 14:59:33 -0700 [thread overview]
Message-ID: <CAAVpQUAL09OGKZmf3HkjqqkknaytQ59EXozAVqJuwOZZucLR0Q@mail.gmail.com> (raw)
In-Reply-To: <xjtbk6g2a3x26sqqrdxbm2vxgxmm3nfaryxlxwipwohsscg7qg@64ueif57zont>
On Tue, Jul 22, 2025 at 12:56 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Jul 22, 2025 at 12:03:48PM -0700, Kuniyuki Iwashima wrote:
> > On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote:
> > > > >
> > > > > I expect this state of jobs with different network accounting config
> > > > > running concurrently is temporary while the migrationg from one to other
> > > > > is happening. Please correct me if I am wrong.
> > > >
> > > > We need to migrate workload gradually and the system-wide config
> > > > does not work at all. AFAIU, there are already years of effort spent
> > > > on the migration but it's not yet completed at Google. So, I don't think
> > > > the need is temporary.
> > > >
> > >
> > > From what I remembered shared borg had completely moved to memcg
> > > accounting of network memory (with sys container as an exception) years
> > > ago. Did something change there?
> >
> > AFAICS, there are some workloads that opted out from memcg and
> > consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering
> > OOM and disrupting other workloads.
> >
>
> What were the reasons behind opting out? We should fix those
> instead of a permanent opt-out option.
>
> > >
> > > > >
> > > > > My main concern with the memcg knob is that it is permanent and it
> > > > > requires a hierarchical semantics. No need to add a permanent interface
> > > > > for a temporary need and I don't see a clear hierarchical semantic for
> > > > > this interface.
> > > >
> > > > I don't see merits of having hierarchical semantics for this knob.
> > > > Regardless of this knob, hierarchical semantics is guaranteed
> > > > by other knobs. I think such semantics for this knob just complicates
> > > > the code with no gain.
> > > >
> > >
> > > Cgroup interfaces are hierarchical and we want to keep it that way.
> > > Putting non-hierarchical interfaces just makes configuration and setup
> > > hard to reason about.
> >
> > Actually, I tried that way in the initial draft version, but even if the
> > parent's knob is 1 and child one is 0, a harmful scenario didn't come
> > to my mind.
> >
>
> It is not just about harmful scenario but more about clear semantics.
> Check memory.zswap.writeback semantics.
zswap checks all parent cgroups when evaluating the knob, but
this is not an option for the networking fast path as we cannot
check them for every skb, which will degrade the performance.
Also, we don't track which sockets were created with the knob
enabled and how many such sockets are still left under the cgroup,
there is no way to keep options consistent throughout the hierarchy
and no need to try hard to make the option pretend to be consistent
if there's no real issue.
>
> >
> > >
> > > >
> > > > >
> > > > > I am wondering if alternative approches for per-workload settings are
> > > > > explore starting with BPF.
> > > > >
> > >
> > > Any response on the above? Any alternative approaches explored?
> >
> > Do you mean flagging each socket by BPF at cgroup hook ?
>
> Not sure. Will it not be very similar to your current approach? Each
> socket is associated with a memcg and the at the place where you need to
> check which accounting method to use, just check that memcg setting in
> bpf and you can cache the result in socket as well.
The socket pointer is not writable by default, thus we need to add
a bpf helper or kfunc just for flipping a single bit. As said, this is
overkill, and per-memcg knob is much simpler.
>
> >
> > I think it's overkill and we don't need such finer granularity.
> >
> > Also it sounds way too hacky to use BPF to correct the weird
> > behaviour from day0.
>
> What weird behavior? Two accounting mechanisms. Yes I agree but memcgs
> with different accounting mechanisms concurrently is also weird.
Not that weird given the root cgroup does not allocate sk->sk_memcg
and are subject to the global tcp memory accounting. We already have
a mixed set of memcgs.
Also, not every cgroup sets memory limits. systemd puts some
processes to a non-root cgroup by default without setting memory.max.
In such a case we definitely want the global memory accounting to take
place.
Having to set memory.max to every non-root cgroup is less flexible
and too restricted.
next prev parent reply other threads:[~2025-07-22 21:59 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
2025-07-21 20:35 ` [PATCH v1 net-next 01/13] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n Kuniyuki Iwashima
2025-07-22 14:30 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 02/13] mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready() Kuniyuki Iwashima
2025-07-22 14:33 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 03/13] tcp: Simplify error path in inet_csk_accept() Kuniyuki Iwashima
2025-07-22 14:34 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 04/13] net: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV Kuniyuki Iwashima
2025-07-22 14:37 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 05/13] net: Clean up __sk_mem_raise_allocated() Kuniyuki Iwashima
2025-07-22 14:38 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 06/13] net-memcg: Introduce mem_cgroup_from_sk() Kuniyuki Iwashima
2025-07-22 14:39 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 07/13] net-memcg: Introduce mem_cgroup_sk_enabled() Kuniyuki Iwashima
2025-07-22 14:40 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 08/13] net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge() Kuniyuki Iwashima
2025-07-22 14:56 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 09/13] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure() Kuniyuki Iwashima
2025-07-22 14:58 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 10/13] net: Define sk_memcg under CONFIG_MEMCG Kuniyuki Iwashima
2025-07-22 14:58 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 11/13] net-memcg: Add memory.socket_isolated knob Kuniyuki Iwashima
2025-07-22 15:00 ` Eric Dumazet
2025-07-31 13:39 ` Michal Koutný
2025-07-21 20:35 ` [PATCH v1 net-next 12/13] net-memcg: Store memcg->socket_isolated in sk->sk_memcg Kuniyuki Iwashima
2025-07-22 15:02 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
2025-07-22 15:14 ` Shakeel Butt
2025-07-22 15:24 ` Eric Dumazet
2025-07-22 15:52 ` Shakeel Butt
2025-07-22 18:18 ` Kuniyuki Iwashima
2025-07-22 18:47 ` Shakeel Butt
2025-07-22 19:03 ` Kuniyuki Iwashima
2025-07-22 19:56 ` Shakeel Butt
2025-07-22 21:59 ` Kuniyuki Iwashima [this message]
2025-07-23 0:29 ` Shakeel Butt
2025-07-23 2:35 ` Kuniyuki Iwashima
2025-07-23 17:28 ` Shakeel Butt
2025-07-23 18:06 ` Kuniyuki Iwashima
2025-07-25 1:49 ` Jakub Kicinski
2025-07-25 18:50 ` Kuniyuki Iwashima
2025-07-28 16:07 ` Johannes Weiner
2025-07-28 21:41 ` Kuniyuki Iwashima
2025-07-29 14:22 ` Johannes Weiner
2025-07-29 19:41 ` Kuniyuki Iwashima
2025-07-31 2:58 ` Roman Gushchin
2025-07-31 13:38 ` Michal Koutný
2025-07-31 23:51 ` Kuniyuki Iwashima
2025-08-01 7:00 ` Michal Koutný
2025-08-01 16:27 ` Kuniyuki Iwashima
2025-07-22 15:04 ` [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Shakeel Butt
2025-07-22 15:34 ` Eric Dumazet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAAVpQUAL09OGKZmf3HkjqqkknaytQ59EXozAVqJuwOZZucLR0Q@mail.gmail.com \
--to=kuniyu@google.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=geliang@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=kuni1840@gmail.com \
--cc=linux-mm@kvack.org \
--cc=martineau@kernel.org \
--cc=matttbe@kernel.org \
--cc=mhocko@kernel.org \
--cc=mptcp@lists.linux.dev \
--cc=muchun.song@linux.dev \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=willemb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).