* [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated.
@ 2025-07-21 20:35 Kuniyuki Iwashima
2025-07-21 20:35 ` [PATCH v1 net-next 01/13] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n Kuniyuki Iwashima
` (13 more replies)
0 siblings, 14 replies; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-21 20:35 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton
Cc: Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
Some protocols (e.g., TCP, UDP) has their own memory accounting for
socket buffers and charge memory to global per-protocol counters such
as /proc/net/ipv4/tcp_mem.
When running under a non-root cgroup, this memory is also charged to
the memcg as sock in memory.stat.
Sockets using such protocols are still subject to the global limits,
thus affected by a noisy neighbour outside cgroup.
This makes it difficult to accurately estimate and configure appropriate
global limits.
If all workloads were guaranteed to be controlled under memcg, the issue
can be worked around by setting tcp_mem[0~2] to UINT_MAX.
However, this assumption does not always hold, and a single workload that
opts out of memcg can consume memory up to the global limit, which is
problematic.
This series introduces a new per-memcg know to allow decoupling memcg
from the global memory accounting, which simplifies the memcg
configuration while keeping the global limits within a reasonable range.
Overview of the series:
patch 1 is a bug fix for MPTCP
patch 2 ~ 9 move sk->sk_memcg accesses to a single place
patch 10 moves sk_memcg under CONFIG_MEMCG
patch 11 & 12 introduces a flag and stores it to the lowest bit of sk->sk_memcg
patch 13 decouples memcg from sk_prot->memory_allocated based on the flag
Kuniyuki Iwashima (13):
mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n.
mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready().
tcp: Simplify error path in inet_csk_accept().
net: Call trace_sock_exceed_buf_limit() for memcg failure with
SK_MEM_RECV.
net: Clean up __sk_mem_raise_allocated().
net-memcg: Introduce mem_cgroup_from_sk().
net-memcg: Introduce mem_cgroup_sk_enabled().
net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge().
net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure().
net: Define sk_memcg under CONFIG_MEMCG.
net-memcg: Add memory.socket_isolated knob.
net-memcg: Store memcg->socket_isolated in sk->sk_memcg.
net-memcg: Allow decoupling memcg from global protocol memory
accounting.
Documentation/admin-guide/cgroup-v2.rst | 16 +++++
include/linux/memcontrol.h | 50 ++++++++-----
include/net/proto_memory.h | 10 ++-
include/net/sock.h | 66 +++++++++++++++++
include/net/tcp.h | 10 ++-
mm/memcontrol.c | 84 +++++++++++++++++++---
net/core/sock.c | 95 ++++++++++++++++---------
net/ipv4/inet_connection_sock.c | 35 +++++----
net/ipv4/tcp_output.c | 13 ++--
net/mptcp/protocol.h | 4 +-
net/mptcp/subflow.c | 11 +--
11 files changed, 299 insertions(+), 95 deletions(-)
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v1 net-next 01/13] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n.
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
@ 2025-07-21 20:35 ` Kuniyuki Iwashima
2025-07-22 14:30 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 02/13] mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready() Kuniyuki Iwashima
` (12 subsequent siblings)
13 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-21 20:35 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton
Cc: Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
When sk_alloc() allocates a socket, mem_cgroup_sk_alloc() sets
sk->sk_memcg based on the current task.
MPTCP subflow socket creation is triggered from userspace or
an in-kernel worker.
In the latter case, sk->sk_memcg is not what we want. So, we fix
it up from the parent socket's sk->sk_memcg in mptcp_attach_cgroup().
Although the code is placed under #ifdef CONFIG_MEMCG, it is buried
under #ifdef CONFIG_SOCK_CGROUP_DATA.
The two configs are orthogonal. If CONFIG_MEMCG is enabled without
CONFIG_SOCK_CGROUP_DATA, the subflow's memory usage is not charged
correctly.
Let's move the code out of the wrong ifdef guard.
Note that sk->sk_memcg is freed in sk_prot_free() and the parent
sk holds the refcnt of memcg->css here, so we don't need to use
css_tryget().
Fixes: 3764b0c5651e3 ("mptcp: attach subflow socket to parent cgroup")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
include/linux/memcontrol.h | 7 +++++++
mm/memcontrol.c | 10 ++++++++++
net/mptcp/subflow.c | 11 +++--------
3 files changed, 20 insertions(+), 8 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 87b6688f124a7..d8319ad5e8ea7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1602,6 +1602,8 @@ extern struct static_key_false memcg_sockets_enabled_key;
#define mem_cgroup_sockets_enabled static_branch_unlikely(&memcg_sockets_enabled_key)
void mem_cgroup_sk_alloc(struct sock *sk);
void mem_cgroup_sk_free(struct sock *sk);
+void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk);
+
static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
{
#ifdef CONFIG_MEMCG_V1
@@ -1623,6 +1625,11 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg);
#define mem_cgroup_sockets_enabled 0
static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
static inline void mem_cgroup_sk_free(struct sock *sk) { };
+
+static inline void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk)
+{
+}
+
static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
{
return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 70fdeda1120b3..54eb25d8d555c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5090,6 +5090,16 @@ void mem_cgroup_sk_free(struct sock *sk)
css_put(&sk->sk_memcg->css);
}
+void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk)
+{
+ if (sk->sk_memcg == newsk->sk_memcg)
+ return;
+
+ mem_cgroup_sk_free(newsk);
+ css_get(&sk->sk_memcg->css);
+ newsk->sk_memcg = sk->sk_memcg;
+}
+
/**
* mem_cgroup_charge_skmem - charge socket memory
* @memcg: memcg to charge
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 1802bc5435a1a..f21d90fb1a19d 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1716,19 +1716,14 @@ static void mptcp_attach_cgroup(struct sock *parent, struct sock *child)
/* only the additional subflows created by kworkers have to be modified */
if (cgroup_id(sock_cgroup_ptr(parent_skcd)) !=
cgroup_id(sock_cgroup_ptr(child_skcd))) {
-#ifdef CONFIG_MEMCG
- struct mem_cgroup *memcg = parent->sk_memcg;
-
- mem_cgroup_sk_free(child);
- if (memcg && css_tryget(&memcg->css))
- child->sk_memcg = memcg;
-#endif /* CONFIG_MEMCG */
-
cgroup_sk_free(child_skcd);
*child_skcd = *parent_skcd;
cgroup_sk_clone(child_skcd);
}
#endif /* CONFIG_SOCK_CGROUP_DATA */
+
+ if (mem_cgroup_sockets_enabled && parent->sk_memcg)
+ mem_cgroup_sk_inherit(parent, child);
}
static void mptcp_subflow_ops_override(struct sock *ssk)
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v1 net-next 02/13] mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready().
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
2025-07-21 20:35 ` [PATCH v1 net-next 01/13] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n Kuniyuki Iwashima
@ 2025-07-21 20:35 ` Kuniyuki Iwashima
2025-07-22 14:33 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 03/13] tcp: Simplify error path in inet_csk_accept() Kuniyuki Iwashima
` (11 subsequent siblings)
13 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-21 20:35 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton
Cc: Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
Some conditions used in mptcp_epollin_ready() are the same as
tcp_under_memory_pressure().
We will modify tcp_under_memory_pressure() in the later patch.
Let's use tcp_under_memory_pressure() instead.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
net/mptcp/protocol.h | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 6ec245fd2778e..752e8277f2616 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -787,9 +787,7 @@ static inline bool mptcp_epollin_ready(const struct sock *sk)
* as it can always coalesce them
*/
return (data_avail >= sk->sk_rcvlowat) ||
- (mem_cgroup_sockets_enabled && sk->sk_memcg &&
- mem_cgroup_under_socket_pressure(sk->sk_memcg)) ||
- READ_ONCE(tcp_memory_pressure);
+ tcp_under_memory_pressure(sk);
}
int mptcp_set_rcvlowat(struct sock *sk, int val);
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v1 net-next 03/13] tcp: Simplify error path in inet_csk_accept().
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
2025-07-21 20:35 ` [PATCH v1 net-next 01/13] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n Kuniyuki Iwashima
2025-07-21 20:35 ` [PATCH v1 net-next 02/13] mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready() Kuniyuki Iwashima
@ 2025-07-21 20:35 ` Kuniyuki Iwashima
2025-07-22 14:34 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 04/13] net: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV Kuniyuki Iwashima
` (10 subsequent siblings)
13 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-21 20:35 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton
Cc: Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
When an error occurs in inet_csk_accept(), what we should do is
only call release_sock() and set the errno to arg->err.
But the path jumps to another label, which introduces unnecessary
initialisation and tests for newsk.
Let's simplify the error path and remove the redundant NULL
checks for newsk.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
net/ipv4/inet_connection_sock.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 1e2df51427fed..724bd9ed6cd48 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -706,9 +706,9 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
spin_unlock_bh(&queue->fastopenq.lock);
}
-out:
release_sock(sk);
- if (newsk && mem_cgroup_sockets_enabled) {
+
+ if (mem_cgroup_sockets_enabled) {
gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
int amt = 0;
@@ -732,18 +732,17 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
release_sock(newsk);
}
+
if (req)
reqsk_put(req);
- if (newsk)
- inet_init_csk_locks(newsk);
-
+ inet_init_csk_locks(newsk);
return newsk;
+
out_err:
- newsk = NULL;
- req = NULL;
+ release_sock(sk);
arg->err = error;
- goto out;
+ return NULL;
}
EXPORT_SYMBOL(inet_csk_accept);
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v1 net-next 04/13] net: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV.
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
` (2 preceding siblings ...)
2025-07-21 20:35 ` [PATCH v1 net-next 03/13] tcp: Simplify error path in inet_csk_accept() Kuniyuki Iwashima
@ 2025-07-21 20:35 ` Kuniyuki Iwashima
2025-07-22 14:37 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 05/13] net: Clean up __sk_mem_raise_allocated() Kuniyuki Iwashima
` (9 subsequent siblings)
13 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-21 20:35 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton
Cc: Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
Initially, trace_sock_exceed_buf_limit() was invoked when
__sk_mem_raise_allocated() failed due to the memcg limit or the
global limit.
However, commit d6f19938eb031 ("net: expose sk wmem in
sock_exceed_buf_limit tracepoint") somehow suppressed the event
only when memcg failed to charge for SK_MEM_RECV, although the
memcg failure for SK_MEM_SEND still triggers the event.
Let's restore the event for SK_MEM_RECV.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
net/core/sock.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/net/core/sock.c b/net/core/sock.c
index 7c26ec8dce630..380bc1aa69829 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3354,8 +3354,7 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
}
}
- if (kind == SK_MEM_SEND || (kind == SK_MEM_RECV && charged))
- trace_sock_exceed_buf_limit(sk, prot, allocated, kind);
+ trace_sock_exceed_buf_limit(sk, prot, allocated, kind);
sk_memory_allocated_sub(sk, amt);
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v1 net-next 05/13] net: Clean up __sk_mem_raise_allocated().
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
` (3 preceding siblings ...)
2025-07-21 20:35 ` [PATCH v1 net-next 04/13] net: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV Kuniyuki Iwashima
@ 2025-07-21 20:35 ` Kuniyuki Iwashima
2025-07-22 14:38 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 06/13] net-memcg: Introduce mem_cgroup_from_sk() Kuniyuki Iwashima
` (8 subsequent siblings)
13 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-21 20:35 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton
Cc: Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
In __sk_mem_raise_allocated(), charged is initialised as true due
to the weird condition removed in the previous patch.
It makes the variable unreliable by itself, so we have to check
another variable, memcg, in advance.
Also, we will factorise the common check below for memcg later.
if (mem_cgroup_sockets_enabled && sk->sk_memcg)
As a prep, let's initialise charged as false and memcg as NULL.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
net/core/sock.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/net/core/sock.c b/net/core/sock.c
index 380bc1aa69829..000940ecf360e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3263,15 +3263,16 @@ EXPORT_SYMBOL(sk_wait_data);
*/
int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
{
- struct mem_cgroup *memcg = mem_cgroup_sockets_enabled ? sk->sk_memcg : NULL;
struct proto *prot = sk->sk_prot;
- bool charged = true;
+ struct mem_cgroup *memcg = NULL;
+ bool charged = false;
long allocated;
sk_memory_allocated_add(sk, amt);
allocated = sk_memory_allocated(sk);
- if (memcg) {
+ if (mem_cgroup_sockets_enabled && sk->sk_memcg) {
+ memcg = sk->sk_memcg;
charged = mem_cgroup_charge_skmem(memcg, amt, gfp_memcg_charge());
if (!charged)
goto suppress_allocation;
@@ -3358,7 +3359,7 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
sk_memory_allocated_sub(sk, amt);
- if (memcg && charged)
+ if (charged)
mem_cgroup_uncharge_skmem(memcg, amt);
return 0;
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v1 net-next 06/13] net-memcg: Introduce mem_cgroup_from_sk().
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
` (4 preceding siblings ...)
2025-07-21 20:35 ` [PATCH v1 net-next 05/13] net: Clean up __sk_mem_raise_allocated() Kuniyuki Iwashima
@ 2025-07-21 20:35 ` Kuniyuki Iwashima
2025-07-22 14:39 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 07/13] net-memcg: Introduce mem_cgroup_sk_enabled() Kuniyuki Iwashima
` (7 subsequent siblings)
13 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-21 20:35 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton
Cc: Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
We will store a flag in the lowest bit of sk->sk_memcg.
Then, directly dereferencing sk->sk_memcg will be illegal, and we
do not want to allow touching the raw sk->sk_memcg in many places.
Let's introduce mem_cgroup_from_sk().
Other places accessing the raw sk->sk_memcg will be converted later.
Note that we cannot define the helper as an inline function in
memcontrol.h as we cannot access any fields of struct sock there
due to circular dependency, so it is placed in sock.h.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
include/net/sock.h | 12 ++++++++++++
mm/memcontrol.c | 14 +++++++++-----
net/ipv4/inet_connection_sock.c | 2 +-
3 files changed, 22 insertions(+), 6 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index c8a4b283df6fc..811f95ea8d00c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2594,6 +2594,18 @@ static inline gfp_t gfp_memcg_charge(void)
return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
}
+#ifdef CONFIG_MEMCG
+static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
+{
+ return sk->sk_memcg;
+}
+#else
+static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
+{
+ return NULL;
+}
+#endif
+
static inline long sock_rcvtimeo(const struct sock *sk, bool noblock)
{
return noblock ? 0 : READ_ONCE(sk->sk_rcvtimeo);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 54eb25d8d555c..89b33e635cf89 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5086,18 +5086,22 @@ void mem_cgroup_sk_alloc(struct sock *sk)
void mem_cgroup_sk_free(struct sock *sk)
{
- if (sk->sk_memcg)
- css_put(&sk->sk_memcg->css);
+ struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
+
+ if (memcg)
+ css_put(&memcg->css);
}
void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk)
{
- if (sk->sk_memcg == newsk->sk_memcg)
+ struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
+
+ if (memcg == mem_cgroup_from_sk(newsk))
return;
mem_cgroup_sk_free(newsk);
- css_get(&sk->sk_memcg->css);
- newsk->sk_memcg = sk->sk_memcg;
+ css_get(&memcg->css);
+ newsk->sk_memcg = memcg;
}
/**
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 724bd9ed6cd48..93569bbe00f44 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -718,7 +718,7 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
lock_sock(newsk);
mem_cgroup_sk_alloc(newsk);
- if (newsk->sk_memcg) {
+ if (mem_cgroup_from_sk(newsk)) {
/* The socket has not been accepted yet, no need
* to look at newsk->sk_wmem_queued.
*/
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v1 net-next 07/13] net-memcg: Introduce mem_cgroup_sk_enabled().
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
` (5 preceding siblings ...)
2025-07-21 20:35 ` [PATCH v1 net-next 06/13] net-memcg: Introduce mem_cgroup_from_sk() Kuniyuki Iwashima
@ 2025-07-21 20:35 ` Kuniyuki Iwashima
2025-07-22 14:40 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 08/13] net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge() Kuniyuki Iwashima
` (6 subsequent siblings)
13 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-21 20:35 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton
Cc: Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
The socket memcg feature is enabled by a static key and
only works for non-root cgroup.
We check both conditions in many places.
Let's factorise it as a helper function.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
include/net/proto_memory.h | 2 +-
include/net/sock.h | 10 ++++++++++
include/net/tcp.h | 2 +-
net/core/sock.c | 6 +++---
net/ipv4/tcp_output.c | 2 +-
net/mptcp/subflow.c | 2 +-
6 files changed, 17 insertions(+), 7 deletions(-)
diff --git a/include/net/proto_memory.h b/include/net/proto_memory.h
index a6ab2f4f5e28a..859e63de81c49 100644
--- a/include/net/proto_memory.h
+++ b/include/net/proto_memory.h
@@ -31,7 +31,7 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
if (!sk->sk_prot->memory_pressure)
return false;
- if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
+ if (mem_cgroup_sk_enabled(sk) &&
mem_cgroup_under_socket_pressure(sk->sk_memcg))
return true;
diff --git a/include/net/sock.h b/include/net/sock.h
index 811f95ea8d00c..3efdf680401dd 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2599,11 +2599,21 @@ static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
{
return sk->sk_memcg;
}
+
+static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
+{
+ return mem_cgroup_sockets_enabled && mem_cgroup_from_sk(sk);
+}
#else
static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
{
return NULL;
}
+
+static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
+{
+ return false;
+}
#endif
static inline long sock_rcvtimeo(const struct sock *sk, bool noblock)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index b3815d1043400..f9a0eb242e65c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -275,7 +275,7 @@ extern unsigned long tcp_memory_pressure;
/* optimized version of sk_under_memory_pressure() for TCP sockets */
static inline bool tcp_under_memory_pressure(const struct sock *sk)
{
- if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
+ if (mem_cgroup_sk_enabled(sk) &&
mem_cgroup_under_socket_pressure(sk->sk_memcg))
return true;
diff --git a/net/core/sock.c b/net/core/sock.c
index 000940ecf360e..ab658fe23e1e6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1032,7 +1032,7 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
bool charged;
int pages;
- if (!mem_cgroup_sockets_enabled || !sk->sk_memcg || !sk_has_account(sk))
+ if (!mem_cgroup_sk_enabled(sk) || !sk_has_account(sk))
return -EOPNOTSUPP;
if (!bytes)
@@ -3271,7 +3271,7 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
sk_memory_allocated_add(sk, amt);
allocated = sk_memory_allocated(sk);
- if (mem_cgroup_sockets_enabled && sk->sk_memcg) {
+ if (mem_cgroup_sk_enabled(sk)) {
memcg = sk->sk_memcg;
charged = mem_cgroup_charge_skmem(memcg, amt, gfp_memcg_charge());
if (!charged)
@@ -3398,7 +3398,7 @@ void __sk_mem_reduce_allocated(struct sock *sk, int amount)
{
sk_memory_allocated_sub(sk, amount);
- if (mem_cgroup_sockets_enabled && sk->sk_memcg)
+ if (mem_cgroup_sk_enabled(sk))
mem_cgroup_uncharge_skmem(sk->sk_memcg, amount);
if (sk_under_global_memory_pressure(sk) &&
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b616776e3354c..4e0af5c824c1a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3566,7 +3566,7 @@ void sk_forced_mem_schedule(struct sock *sk, int size)
sk_forward_alloc_add(sk, amt << PAGE_SHIFT);
sk_memory_allocated_add(sk, amt);
- if (mem_cgroup_sockets_enabled && sk->sk_memcg)
+ if (mem_cgroup_sk_enabled(sk))
mem_cgroup_charge_skmem(sk->sk_memcg, amt,
gfp_memcg_charge() | __GFP_NOFAIL);
}
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index f21d90fb1a19d..5325642bcbbce 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1722,7 +1722,7 @@ static void mptcp_attach_cgroup(struct sock *parent, struct sock *child)
}
#endif /* CONFIG_SOCK_CGROUP_DATA */
- if (mem_cgroup_sockets_enabled && parent->sk_memcg)
+ if (mem_cgroup_sk_enabled(parent))
mem_cgroup_sk_inherit(parent, child);
}
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v1 net-next 08/13] net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge().
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
` (6 preceding siblings ...)
2025-07-21 20:35 ` [PATCH v1 net-next 07/13] net-memcg: Introduce mem_cgroup_sk_enabled() Kuniyuki Iwashima
@ 2025-07-21 20:35 ` Kuniyuki Iwashima
2025-07-22 14:56 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 09/13] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure() Kuniyuki Iwashima
` (5 subsequent siblings)
13 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-21 20:35 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton
Cc: Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
We will store a flag in the lowest bit of sk->sk_memcg.
Then, we cannot pass the raw pointer to mem_cgroup_charge_skmem()
and mem_cgroup_uncharge_skmem().
Let's pass struct sock to the functions.
While at it, they are renamed to match other functions starting
with mem_cgroup_sk_.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
include/linux/memcontrol.h | 29 ++++++++++++++++++++++++-----
mm/memcontrol.c | 18 +++++++++++-------
net/core/sock.c | 24 +++++++++++-------------
net/ipv4/inet_connection_sock.c | 2 +-
net/ipv4/tcp_output.c | 3 +--
5 files changed, 48 insertions(+), 28 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d8319ad5e8ea7..9ccbcddbe3b8e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1594,15 +1594,16 @@ static inline void mem_cgroup_flush_foreign(struct bdi_writeback *wb)
#endif /* CONFIG_CGROUP_WRITEBACK */
struct sock;
-bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
- gfp_t gfp_mask);
-void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
#ifdef CONFIG_MEMCG
extern struct static_key_false memcg_sockets_enabled_key;
#define mem_cgroup_sockets_enabled static_branch_unlikely(&memcg_sockets_enabled_key)
+
void mem_cgroup_sk_alloc(struct sock *sk);
void mem_cgroup_sk_free(struct sock *sk);
void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk);
+bool mem_cgroup_sk_charge(const struct sock *sk, unsigned int nr_pages,
+ gfp_t gfp_mask);
+void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages);
static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
{
@@ -1623,13 +1624,31 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
void reparent_shrinker_deferred(struct mem_cgroup *memcg);
#else
#define mem_cgroup_sockets_enabled 0
-static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
-static inline void mem_cgroup_sk_free(struct sock *sk) { };
+
+static inline void mem_cgroup_sk_alloc(struct sock *sk)
+{
+}
+
+static inline void mem_cgroup_sk_free(struct sock *sk)
+{
+}
static inline void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk)
{
}
+static inline bool mem_cgroup_sk_charge(const struct sock *sk,
+ unsigned int nr_pages,
+ gfp_t gfp_mask)
+{
+ return false;
+}
+
+static inline void mem_cgroup_sk_uncharge(const struct sock *sk,
+ unsigned int nr_pages)
+{
+}
+
static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
{
return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 89b33e635cf89..d7f4e31f4e625 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5105,17 +5105,19 @@ void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk)
}
/**
- * mem_cgroup_charge_skmem - charge socket memory
- * @memcg: memcg to charge
+ * mem_cgroup_sk_charge - charge socket memory
+ * @sk: socket in memcg to charge
* @nr_pages: number of pages to charge
* @gfp_mask: reclaim mode
*
* Charges @nr_pages to @memcg. Returns %true if the charge fit within
* @memcg's configured limit, %false if it doesn't.
*/
-bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
- gfp_t gfp_mask)
+bool mem_cgroup_sk_charge(const struct sock *sk, unsigned int nr_pages,
+ gfp_t gfp_mask)
{
+ struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
+
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return memcg1_charge_skmem(memcg, nr_pages, gfp_mask);
@@ -5128,12 +5130,14 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
}
/**
- * mem_cgroup_uncharge_skmem - uncharge socket memory
- * @memcg: memcg to uncharge
+ * mem_cgroup_sk_uncharge - uncharge socket memory
+ * @sk: socket in memcg to uncharge
* @nr_pages: number of pages to uncharge
*/
-void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
+void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages)
{
+ struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
+
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
memcg1_uncharge_skmem(memcg, nr_pages);
return;
diff --git a/net/core/sock.c b/net/core/sock.c
index ab658fe23e1e6..5537ca2638588 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1041,8 +1041,8 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
pages = sk_mem_pages(bytes);
/* pre-charge to memcg */
- charged = mem_cgroup_charge_skmem(sk->sk_memcg, pages,
- GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+ charged = mem_cgroup_sk_charge(sk, pages,
+ GFP_KERNEL | __GFP_RETRY_MAYFAIL);
if (!charged)
return -ENOMEM;
@@ -1054,7 +1054,7 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
*/
if (allocated > sk_prot_mem_limits(sk, 1)) {
sk_memory_allocated_sub(sk, pages);
- mem_cgroup_uncharge_skmem(sk->sk_memcg, pages);
+ mem_cgroup_sk_uncharge(sk, pages);
return -ENOMEM;
}
sk_forward_alloc_add(sk, pages << PAGE_SHIFT);
@@ -3263,17 +3263,16 @@ EXPORT_SYMBOL(sk_wait_data);
*/
int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
{
+ bool memcg_enabled = false, charged = false;
struct proto *prot = sk->sk_prot;
- struct mem_cgroup *memcg = NULL;
- bool charged = false;
long allocated;
sk_memory_allocated_add(sk, amt);
allocated = sk_memory_allocated(sk);
if (mem_cgroup_sk_enabled(sk)) {
- memcg = sk->sk_memcg;
- charged = mem_cgroup_charge_skmem(memcg, amt, gfp_memcg_charge());
+ memcg_enabled = true;
+ charged = mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge());
if (!charged)
goto suppress_allocation;
}
@@ -3347,10 +3346,9 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
*/
if (sk->sk_wmem_queued + size >= sk->sk_sndbuf) {
/* Force charge with __GFP_NOFAIL */
- if (memcg && !charged) {
- mem_cgroup_charge_skmem(memcg, amt,
- gfp_memcg_charge() | __GFP_NOFAIL);
- }
+ if (memcg_enabled && !charged)
+ mem_cgroup_sk_charge(sk, amt,
+ gfp_memcg_charge() | __GFP_NOFAIL);
return 1;
}
}
@@ -3360,7 +3358,7 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
sk_memory_allocated_sub(sk, amt);
if (charged)
- mem_cgroup_uncharge_skmem(memcg, amt);
+ mem_cgroup_sk_uncharge(sk, amt);
return 0;
}
@@ -3399,7 +3397,7 @@ void __sk_mem_reduce_allocated(struct sock *sk, int amount)
sk_memory_allocated_sub(sk, amount);
if (mem_cgroup_sk_enabled(sk))
- mem_cgroup_uncharge_skmem(sk->sk_memcg, amount);
+ mem_cgroup_sk_uncharge(sk, amount);
if (sk_under_global_memory_pressure(sk) &&
(sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 93569bbe00f44..0ef1eacd539d1 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -727,7 +727,7 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
}
if (amt)
- mem_cgroup_charge_skmem(newsk->sk_memcg, amt, gfp);
+ mem_cgroup_sk_charge(newsk, amt, gfp);
kmem_cache_charge(newsk, gfp);
release_sock(newsk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4e0af5c824c1a..09f0802f36afa 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3567,8 +3567,7 @@ void sk_forced_mem_schedule(struct sock *sk, int size)
sk_memory_allocated_add(sk, amt);
if (mem_cgroup_sk_enabled(sk))
- mem_cgroup_charge_skmem(sk->sk_memcg, amt,
- gfp_memcg_charge() | __GFP_NOFAIL);
+ mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge() | __GFP_NOFAIL);
}
/* Send a FIN. The caller locks the socket for us.
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v1 net-next 09/13] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure().
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
` (7 preceding siblings ...)
2025-07-21 20:35 ` [PATCH v1 net-next 08/13] net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge() Kuniyuki Iwashima
@ 2025-07-21 20:35 ` Kuniyuki Iwashima
2025-07-22 14:58 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 10/13] net: Define sk_memcg under CONFIG_MEMCG Kuniyuki Iwashima
` (4 subsequent siblings)
13 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-21 20:35 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton
Cc: Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
We will store a flag in the lowest bit of sk->sk_memcg.
Then, we cannot pass the raw pointer to mem_cgroup_under_socket_pressure().
Let's pass struct sock to it and rename the function to match other
functions starting with mem_cgroup_sk_.
Note that the helper is moved to sock.h to use mem_cgroup_from_sk().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
include/linux/memcontrol.h | 18 ------------------
include/net/proto_memory.h | 2 +-
include/net/sock.h | 21 +++++++++++++++++++++
include/net/tcp.h | 2 +-
4 files changed, 23 insertions(+), 20 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9ccbcddbe3b8e..211712ec57d1a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1605,19 +1605,6 @@ bool mem_cgroup_sk_charge(const struct sock *sk, unsigned int nr_pages,
gfp_t gfp_mask);
void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages);
-static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
-{
-#ifdef CONFIG_MEMCG_V1
- if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
- return !!memcg->tcpmem_pressure;
-#endif /* CONFIG_MEMCG_V1 */
- do {
- if (time_before(jiffies, READ_ONCE(memcg->socket_pressure)))
- return true;
- } while ((memcg = parent_mem_cgroup(memcg)));
- return false;
-}
-
int alloc_shrinker_info(struct mem_cgroup *memcg);
void free_shrinker_info(struct mem_cgroup *memcg);
void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
@@ -1649,11 +1636,6 @@ static inline void mem_cgroup_sk_uncharge(const struct sock *sk,
{
}
-static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
-{
- return false;
-}
-
static inline void set_shrinker_bit(struct mem_cgroup *memcg,
int nid, int shrinker_id)
{
diff --git a/include/net/proto_memory.h b/include/net/proto_memory.h
index 859e63de81c49..8e91a8fa31b52 100644
--- a/include/net/proto_memory.h
+++ b/include/net/proto_memory.h
@@ -32,7 +32,7 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
return false;
if (mem_cgroup_sk_enabled(sk) &&
- mem_cgroup_under_socket_pressure(sk->sk_memcg))
+ mem_cgroup_sk_under_memory_pressure(sk))
return true;
return !!READ_ONCE(*sk->sk_prot->memory_pressure);
diff --git a/include/net/sock.h b/include/net/sock.h
index 3efdf680401dd..efb2f659236d4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2604,6 +2604,22 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
{
return mem_cgroup_sockets_enabled && mem_cgroup_from_sk(sk);
}
+
+static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
+
+#ifdef CONFIG_MEMCG_V1
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return !!memcg->tcpmem_pressure;
+#endif /* CONFIG_MEMCG_V1 */
+ do {
+ if (time_before(jiffies, READ_ONCE(memcg->socket_pressure)))
+ return true;
+ } while ((memcg = parent_mem_cgroup(memcg)));
+
+ return false;
+}
#else
static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
{
@@ -2614,6 +2630,11 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
{
return false;
}
+
+static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
+{
+ return false;
+}
#endif
static inline long sock_rcvtimeo(const struct sock *sk, bool noblock)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f9a0eb242e65c..9ffe971a1856b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -276,7 +276,7 @@ extern unsigned long tcp_memory_pressure;
static inline bool tcp_under_memory_pressure(const struct sock *sk)
{
if (mem_cgroup_sk_enabled(sk) &&
- mem_cgroup_under_socket_pressure(sk->sk_memcg))
+ mem_cgroup_sk_under_memory_pressure(sk))
return true;
return READ_ONCE(tcp_memory_pressure);
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v1 net-next 10/13] net: Define sk_memcg under CONFIG_MEMCG.
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
` (8 preceding siblings ...)
2025-07-21 20:35 ` [PATCH v1 net-next 09/13] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure() Kuniyuki Iwashima
@ 2025-07-21 20:35 ` Kuniyuki Iwashima
2025-07-22 14:58 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 11/13] net-memcg: Add memory.socket_isolated knob Kuniyuki Iwashima
` (3 subsequent siblings)
13 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-21 20:35 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton
Cc: Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
Except for sk_clone_lock(), all accesses to sk->sk_memcg
is done under CONFIG_MEMCG.
As a bonus, let's define sk->sk_memcg under CONFIG_MEMCG.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
include/net/sock.h | 2 ++
net/core/sock.c | 4 ++++
2 files changed, 6 insertions(+)
diff --git a/include/net/sock.h b/include/net/sock.h
index efb2f659236d4..16fe0e5afc587 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -443,7 +443,9 @@ struct sock {
__cacheline_group_begin(sock_read_rxtx);
int sk_err;
struct socket *sk_socket;
+#ifdef CONFIG_MEMCG
struct mem_cgroup *sk_memcg;
+#endif
#ifdef CONFIG_XFRM
struct xfrm_policy __rcu *sk_policy[2];
#endif
diff --git a/net/core/sock.c b/net/core/sock.c
index 5537ca2638588..ab6953d295dfa 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2512,8 +2512,10 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
sock_reset_flag(newsk, SOCK_DONE);
+#ifdef CONFIG_MEMCG
/* sk->sk_memcg will be populated at accept() time */
newsk->sk_memcg = NULL;
+#endif
cgroup_sk_clone(&newsk->sk_cgrp_data);
@@ -4452,7 +4454,9 @@ static int __init sock_struct_check(void)
CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_read_rxtx, sk_err);
CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_read_rxtx, sk_socket);
+#ifdef CONFIG_MEMCG
CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_read_rxtx, sk_memcg);
+#endif
CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_write_rxtx, sk_lock);
CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_write_rxtx, sk_reserved_mem);
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v1 net-next 11/13] net-memcg: Add memory.socket_isolated knob.
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
` (9 preceding siblings ...)
2025-07-21 20:35 ` [PATCH v1 net-next 10/13] net: Define sk_memcg under CONFIG_MEMCG Kuniyuki Iwashima
@ 2025-07-21 20:35 ` Kuniyuki Iwashima
2025-07-22 15:00 ` Eric Dumazet
2025-07-31 13:39 ` Michal Koutný
2025-07-21 20:35 ` [PATCH v1 net-next 12/13] net-memcg: Store memcg->socket_isolated in sk->sk_memcg Kuniyuki Iwashima
` (2 subsequent siblings)
13 siblings, 2 replies; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-21 20:35 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton
Cc: Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
Some networking protocols have their own global memory accounting,
and such memory is also charged to memcg as sock in memory.stat.
Such sockets are subject to the global limit, thus affected by a
noisy neighbour outside the cgroup.
We will decouple the global memory accounting if configured.
Let's add a per-memcg knob to control that.
The value will be saved in each socket when created and will
persist through the socket's lifetime.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
Documentation/admin-guide/cgroup-v2.rst | 16 +++++++++++
include/linux/memcontrol.h | 6 ++++
include/net/sock.h | 3 ++
mm/memcontrol.c | 37 +++++++++++++++++++++++++
4 files changed, 62 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index bd98ea3175ec1..2428707b7d27d 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1878,6 +1878,22 @@ The following nested keys are defined.
Shows pressure stall information for memory. See
:ref:`Documentation/accounting/psi.rst <psi>` for details.
+ memory.socket_isolated
+ A read-write single value file which exists on non-root cgroups.
+ The default value is "0".
+
+ Some networking protocols (e.g., TCP, UDP) implement their own memory
+ accounting for socket buffers.
+
+ This memory is also charged to a non-root cgroup as sock in memory.stat.
+
+ Since per-protocol limits such as /proc/sys/net/ipv4/tcp_mem and
+ /proc/sys/net/ipv4/udp_mem are global, memory allocation for socket
+ buffers may fail even when the cgroup has available memory.
+
+ Sockets created with socket_isolated set to 1 are no longer subject
+ to these global protocol limits.
+
Usage Guidelines
~~~~~~~~~~~~~~~~
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 211712ec57d1a..7d5d43e3b49e6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -226,6 +226,12 @@ struct mem_cgroup {
*/
bool oom_group;
+ /*
+ * If set, MEMCG_SOCK memory is charged on memcg only,
+ * otherwise, memcg and sk->sk_prot->memory_allocated.
+ */
+ bool socket_isolated;
+
int swappiness;
/* memory.events and memory.events.local */
diff --git a/include/net/sock.h b/include/net/sock.h
index 16fe0e5afc587..5e8c73731531c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2597,6 +2597,9 @@ static inline gfp_t gfp_memcg_charge(void)
}
#ifdef CONFIG_MEMCG
+
+#define MEMCG_SOCK_ISOLATED 1UL
+
static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
{
return sk->sk_memcg;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d7f4e31f4e625..0a55c12a6679b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4645,6 +4645,37 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
return nbytes;
}
+static int memory_socket_isolated_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+ seq_printf(m, "%d\n", READ_ONCE(memcg->socket_isolated));
+
+ return 0;
+}
+
+static ssize_t memory_socket_isolated_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ int ret, socket_isolated;
+
+ buf = strstrip(buf);
+ if (!buf)
+ return -EINVAL;
+
+ ret = kstrtoint(buf, 0, &socket_isolated);
+ if (ret)
+ return ret;
+
+ if (socket_isolated != 0 && socket_isolated != MEMCG_SOCK_ISOLATED)
+ return -EINVAL;
+
+ WRITE_ONCE(memcg->socket_isolated, socket_isolated);
+
+ return nbytes;
+}
+
static struct cftype memory_files[] = {
{
.name = "current",
@@ -4716,6 +4747,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NS_DELEGATABLE,
.write = memory_reclaim,
},
+ {
+ .name = "socket_isolated",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_socket_isolated_show,
+ .write = memory_socket_isolated_write,
+ },
{ } /* terminate */
};
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v1 net-next 12/13] net-memcg: Store memcg->socket_isolated in sk->sk_memcg.
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
` (10 preceding siblings ...)
2025-07-21 20:35 ` [PATCH v1 net-next 11/13] net-memcg: Add memory.socket_isolated knob Kuniyuki Iwashima
@ 2025-07-21 20:35 ` Kuniyuki Iwashima
2025-07-22 15:02 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
2025-07-22 15:04 ` [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Shakeel Butt
13 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-21 20:35 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton
Cc: Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
memcg->socket_isolated can change at any time, so we must
snapshot the value for each socket to ensure consistency.
Given sk->sk_memcg can be accessed in the fast path, it would
be preferable to place the flag field in the same cache line
as sk->sk_memcg.
However, struct sock does not have such a 1-byte hole.
Let's store the flag in the lowest bit of sk->sk_memcg and
add a helper to check the bit.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
include/net/sock.h | 20 +++++++++++++++++++-
mm/memcontrol.c | 13 +++++++++++--
2 files changed, 30 insertions(+), 3 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 5e8c73731531c..2e9d76fc2bf38 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2599,10 +2599,16 @@ static inline gfp_t gfp_memcg_charge(void)
#ifdef CONFIG_MEMCG
#define MEMCG_SOCK_ISOLATED 1UL
+#define MEMCG_SOCK_FLAG_MASK MEMCG_SOCK_ISOLATED
+#define MEMCG_SOCK_PTR_MASK ~(MEMCG_SOCK_FLAG_MASK)
static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
{
- return sk->sk_memcg;
+ unsigned long val = (unsigned long)sk->sk_memcg;
+
+ val &= MEMCG_SOCK_PTR_MASK;
+
+ return (struct mem_cgroup *)val;
}
static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
@@ -2610,6 +2616,13 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
return mem_cgroup_sockets_enabled && mem_cgroup_from_sk(sk);
}
+static inline bool mem_cgroup_sk_isolated(const struct sock *sk)
+{
+ struct mem_cgroup *memcg = sk->sk_memcg;
+
+ return (unsigned long)memcg & MEMCG_SOCK_ISOLATED;
+}
+
static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
{
struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
@@ -2636,6 +2649,11 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
return false;
}
+static inline bool mem_cgroup_sk_isolated(const struct sock *sk)
+{
+ return false;
+}
+
static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
{
return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0a55c12a6679b..85decc4319f96 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5098,6 +5098,15 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
EXPORT_SYMBOL(memcg_sockets_enabled_key);
+static void mem_cgroup_sk_set(struct sock *sk, const struct mem_cgroup *memcg)
+{
+ unsigned long val = (unsigned long)memcg;
+
+ val |= READ_ONCE(memcg->socket_isolated);
+
+ sk->sk_memcg = (struct mem_cgroup *)val;
+}
+
void mem_cgroup_sk_alloc(struct sock *sk)
{
struct mem_cgroup *memcg;
@@ -5116,7 +5125,7 @@ void mem_cgroup_sk_alloc(struct sock *sk)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg1_tcpmem_active(memcg))
goto out;
if (css_tryget(&memcg->css))
- sk->sk_memcg = memcg;
+ mem_cgroup_sk_set(sk, memcg);
out:
rcu_read_unlock();
}
@@ -5138,7 +5147,7 @@ void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk)
mem_cgroup_sk_free(newsk);
css_get(&memcg->css);
- newsk->sk_memcg = memcg;
+ mem_cgroup_sk_set(newsk, memcg);
}
/**
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
` (11 preceding siblings ...)
2025-07-21 20:35 ` [PATCH v1 net-next 12/13] net-memcg: Store memcg->socket_isolated in sk->sk_memcg Kuniyuki Iwashima
@ 2025-07-21 20:35 ` Kuniyuki Iwashima
2025-07-22 15:14 ` Shakeel Butt
` (3 more replies)
2025-07-22 15:04 ` [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Shakeel Butt
13 siblings, 4 replies; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-21 20:35 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton
Cc: Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
Some protocols (e.g., TCP, UDP) implement memory accounting for socket
buffers and charge memory to per-protocol global counters pointed to by
sk->sk_proto->memory_allocated.
When running under a non-root cgroup, this memory is also charged to the
memcg as sock in memory.stat.
Even when memory usage is controlled by memcg, sockets using such protocols
are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
This makes it difficult to accurately estimate and configure appropriate
global limits, especially in multi-tenant environments.
If all workloads were guaranteed to be controlled under memcg, the issue
could be worked around by setting tcp_mem[0~2] to UINT_MAX.
In reality, this assumption does not always hold, and a single workload
that opts out of memcg can consume memory up to the global limit,
becoming a noisy neighbour.
Let's decouple memcg from the global per-protocol memory accounting.
This simplifies memcg configuration while keeping the global limits
within a reasonable range.
If mem_cgroup_sk_isolated(sk) returns true, the per-protocol memory
accounting is skipped.
In inet_csk_accept(), we need to reclaim counts that are already charged
for child sockets because we do not allocate sk->sk_memcg until accept().
Note that trace_sock_exceed_buf_limit() will always show 0 as accounted
for the isolated sockets, but this can be obtained via memory.stat.
Tested with a script that creates local socket pairs and send()s a
bunch of data without recv()ing.
Setup:
# mkdir /sys/fs/cgroup/test
# echo $$ >> /sys/fs/cgroup/test/cgroup.procs
# sysctl -q net.ipv4.tcp_mem="1000 1000 1000"
Without memory.socket_isolated:
# echo 0 > /sys/fs/cgroup/test/memory.socket_isolated
# prlimit -n=524288:524288 bash -c "python3 pressure.py" &
# cat /sys/fs/cgroup/test/memory.stat | grep sock
sock 24682496
# ss -tn | head -n 5
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 2000 0 127.0.0.1:54997 127.0.0.1:37738
ESTAB 2000 0 127.0.0.1:54997 127.0.0.1:60122
ESTAB 2000 0 127.0.0.1:54997 127.0.0.1:33622
ESTAB 2000 0 127.0.0.1:54997 127.0.0.1:35042
# nstat | grep Pressure || echo no pressure
TcpExtTCPMemoryPressures 1 0.0
With memory.socket_isolated:
# echo 1 > /sys/fs/cgroup/test/memory.socket_isolated
# prlimit -n=524288:524288 bash -c "python3 pressure.py" &
# cat /sys/fs/cgroup/test/memory.stat | grep sock
sock 2766671872
# ss -tn | head -n 5
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 112000 0 127.0.0.1:41729 127.0.0.1:35062
ESTAB 110000 0 127.0.0.1:41729 127.0.0.1:36288
ESTAB 112000 0 127.0.0.1:41729 127.0.0.1:37560
ESTAB 112000 0 127.0.0.1:41729 127.0.0.1:37096
# nstat | grep Pressure || echo no pressure
no pressure
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
include/net/proto_memory.h | 10 +++--
include/net/tcp.h | 10 +++--
net/core/sock.c | 65 +++++++++++++++++++++++----------
net/ipv4/inet_connection_sock.c | 18 +++++++--
net/ipv4/tcp_output.c | 10 ++++-
5 files changed, 82 insertions(+), 31 deletions(-)
diff --git a/include/net/proto_memory.h b/include/net/proto_memory.h
index 8e91a8fa31b52..3c2e92f5a6866 100644
--- a/include/net/proto_memory.h
+++ b/include/net/proto_memory.h
@@ -31,9 +31,13 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
if (!sk->sk_prot->memory_pressure)
return false;
- if (mem_cgroup_sk_enabled(sk) &&
- mem_cgroup_sk_under_memory_pressure(sk))
- return true;
+ if (mem_cgroup_sk_enabled(sk)) {
+ if (mem_cgroup_sk_under_memory_pressure(sk))
+ return true;
+
+ if (mem_cgroup_sk_isolated(sk))
+ return false;
+ }
return !!READ_ONCE(*sk->sk_prot->memory_pressure);
}
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9ffe971a1856b..a5ff82a59867b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -275,9 +275,13 @@ extern unsigned long tcp_memory_pressure;
/* optimized version of sk_under_memory_pressure() for TCP sockets */
static inline bool tcp_under_memory_pressure(const struct sock *sk)
{
- if (mem_cgroup_sk_enabled(sk) &&
- mem_cgroup_sk_under_memory_pressure(sk))
- return true;
+ if (mem_cgroup_sk_enabled(sk)) {
+ if (mem_cgroup_sk_under_memory_pressure(sk))
+ return true;
+
+ if (mem_cgroup_sk_isolated(sk))
+ return false;
+ }
return READ_ONCE(tcp_memory_pressure);
}
diff --git a/net/core/sock.c b/net/core/sock.c
index ab6953d295dfa..e1ae6d03b8227 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1046,17 +1046,21 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
if (!charged)
return -ENOMEM;
- /* pre-charge to forward_alloc */
- sk_memory_allocated_add(sk, pages);
- allocated = sk_memory_allocated(sk);
- /* If the system goes into memory pressure with this
- * precharge, give up and return error.
- */
- if (allocated > sk_prot_mem_limits(sk, 1)) {
- sk_memory_allocated_sub(sk, pages);
- mem_cgroup_sk_uncharge(sk, pages);
- return -ENOMEM;
+ if (!mem_cgroup_sk_isolated(sk)) {
+ /* pre-charge to forward_alloc */
+ sk_memory_allocated_add(sk, pages);
+ allocated = sk_memory_allocated(sk);
+
+ /* If the system goes into memory pressure with this
+ * precharge, give up and return error.
+ */
+ if (allocated > sk_prot_mem_limits(sk, 1)) {
+ sk_memory_allocated_sub(sk, pages);
+ mem_cgroup_sk_uncharge(sk, pages);
+ return -ENOMEM;
+ }
}
+
sk_forward_alloc_add(sk, pages << PAGE_SHIFT);
WRITE_ONCE(sk->sk_reserved_mem,
@@ -3153,8 +3157,12 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation)))
return true;
- sk_enter_memory_pressure(sk);
sk_stream_moderate_sndbuf(sk);
+
+ if (mem_cgroup_sk_enabled(sk) && mem_cgroup_sk_isolated(sk))
+ return false;
+
+ sk_enter_memory_pressure(sk);
return false;
}
EXPORT_SYMBOL(sk_page_frag_refill);
@@ -3267,18 +3275,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
{
bool memcg_enabled = false, charged = false;
struct proto *prot = sk->sk_prot;
- long allocated;
-
- sk_memory_allocated_add(sk, amt);
- allocated = sk_memory_allocated(sk);
+ long allocated = 0;
if (mem_cgroup_sk_enabled(sk)) {
+ bool isolated = mem_cgroup_sk_isolated(sk);
+
memcg_enabled = true;
charged = mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge());
- if (!charged)
+
+ if (isolated && charged)
+ return 1;
+
+ if (!charged) {
+ if (!isolated) {
+ sk_memory_allocated_add(sk, amt);
+ allocated = sk_memory_allocated(sk);
+ }
+
goto suppress_allocation;
+ }
}
+ sk_memory_allocated_add(sk, amt);
+ allocated = sk_memory_allocated(sk);
+
/* Under limit. */
if (allocated <= sk_prot_mem_limits(sk, 0)) {
sk_leave_memory_pressure(sk);
@@ -3357,7 +3377,8 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
trace_sock_exceed_buf_limit(sk, prot, allocated, kind);
- sk_memory_allocated_sub(sk, amt);
+ if (allocated)
+ sk_memory_allocated_sub(sk, amt);
if (charged)
mem_cgroup_sk_uncharge(sk, amt);
@@ -3396,11 +3417,15 @@ EXPORT_SYMBOL(__sk_mem_schedule);
*/
void __sk_mem_reduce_allocated(struct sock *sk, int amount)
{
- sk_memory_allocated_sub(sk, amount);
-
- if (mem_cgroup_sk_enabled(sk))
+ if (mem_cgroup_sk_enabled(sk)) {
mem_cgroup_sk_uncharge(sk, amount);
+ if (mem_cgroup_sk_isolated(sk))
+ return;
+ }
+
+ sk_memory_allocated_sub(sk, amount);
+
if (sk_under_global_memory_pressure(sk) &&
(sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
sk_leave_memory_pressure(sk);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 0ef1eacd539d1..9d56085f7f54b 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -22,6 +22,7 @@
#include <net/tcp.h>
#include <net/sock_reuseport.h>
#include <net/addrconf.h>
+#include <net/proto_memory.h>
#if IS_ENABLED(CONFIG_IPV6)
/* match_sk*_wildcard == true: IPV6_ADDR_ANY equals to any IPv6 addresses
@@ -710,7 +711,6 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
if (mem_cgroup_sockets_enabled) {
gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
- int amt = 0;
/* atomically get the memory usage, set and charge the
* newsk->sk_memcg.
@@ -719,15 +719,27 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
mem_cgroup_sk_alloc(newsk);
if (mem_cgroup_from_sk(newsk)) {
+ int amt;
+
/* The socket has not been accepted yet, no need
* to look at newsk->sk_wmem_queued.
*/
amt = sk_mem_pages(newsk->sk_forward_alloc +
atomic_read(&newsk->sk_rmem_alloc));
+ if (amt) {
+ /* This amt is already charged globally to
+ * sk_prot->memory_allocated due to lack of
+ * sk_memcg until accept(), thus we need to
+ * reclaim it here if newsk is isolated.
+ */
+ if (mem_cgroup_sk_isolated(newsk))
+ sk_memory_allocated_sub(newsk, amt);
+
+ mem_cgroup_sk_charge(newsk, amt, gfp);
+ }
+
}
- if (amt)
- mem_cgroup_sk_charge(newsk, amt, gfp);
kmem_cache_charge(newsk, gfp);
release_sock(newsk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 09f0802f36afa..79e705fca8b67 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3562,12 +3562,18 @@ void sk_forced_mem_schedule(struct sock *sk, int size)
delta = size - sk->sk_forward_alloc;
if (delta <= 0)
return;
+
amt = sk_mem_pages(delta);
sk_forward_alloc_add(sk, amt << PAGE_SHIFT);
- sk_memory_allocated_add(sk, amt);
- if (mem_cgroup_sk_enabled(sk))
+ if (mem_cgroup_sk_enabled(sk)) {
mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge() | __GFP_NOFAIL);
+
+ if (mem_cgroup_sk_isolated(sk))
+ return;
+ }
+
+ sk_memory_allocated_add(sk, amt);
}
/* Send a FIN. The caller locks the socket for us.
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 01/13] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n.
2025-07-21 20:35 ` [PATCH v1 net-next 01/13] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n Kuniyuki Iwashima
@ 2025-07-22 14:30 ` Eric Dumazet
0 siblings, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2025-07-22 14:30 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Jakub Kicinski, Neal Cardwell, Paolo Abeni,
Willem de Bruijn, Matthieu Baerts, Mat Martineau, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 21, 2025 at 1:36 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> When sk_alloc() allocates a socket, mem_cgroup_sk_alloc() sets
> sk->sk_memcg based on the current task.
>
> MPTCP subflow socket creation is triggered from userspace or
> an in-kernel worker.
>
> In the latter case, sk->sk_memcg is not what we want. So, we fix
> it up from the parent socket's sk->sk_memcg in mptcp_attach_cgroup().
>
> Although the code is placed under #ifdef CONFIG_MEMCG, it is buried
> under #ifdef CONFIG_SOCK_CGROUP_DATA.
>
> The two configs are orthogonal. If CONFIG_MEMCG is enabled without
> CONFIG_SOCK_CGROUP_DATA, the subflow's memory usage is not charged
> correctly.
>
> Let's move the code out of the wrong ifdef guard.
>
> Note that sk->sk_memcg is freed in sk_prot_free() and the parent
> sk holds the refcnt of memcg->css here, so we don't need to use
> css_tryget().
>
> Fixes: 3764b0c5651e3 ("mptcp: attach subflow socket to parent cgroup")
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 02/13] mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready().
2025-07-21 20:35 ` [PATCH v1 net-next 02/13] mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready() Kuniyuki Iwashima
@ 2025-07-22 14:33 ` Eric Dumazet
0 siblings, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2025-07-22 14:33 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Jakub Kicinski, Neal Cardwell, Paolo Abeni,
Willem de Bruijn, Matthieu Baerts, Mat Martineau, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 21, 2025 at 1:36 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> Some conditions used in mptcp_epollin_ready() are the same as
> tcp_under_memory_pressure().
>
> We will modify tcp_under_memory_pressure() in the later patch.
>
> Let's use tcp_under_memory_pressure() instead.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 03/13] tcp: Simplify error path in inet_csk_accept().
2025-07-21 20:35 ` [PATCH v1 net-next 03/13] tcp: Simplify error path in inet_csk_accept() Kuniyuki Iwashima
@ 2025-07-22 14:34 ` Eric Dumazet
0 siblings, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2025-07-22 14:34 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Jakub Kicinski, Neal Cardwell, Paolo Abeni,
Willem de Bruijn, Matthieu Baerts, Mat Martineau, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 21, 2025 at 1:36 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> When an error occurs in inet_csk_accept(), what we should do is
> only call release_sock() and set the errno to arg->err.
>
> But the path jumps to another label, which introduces unnecessary
> initialisation and tests for newsk.
>
> Let's simplify the error path and remove the redundant NULL
> checks for newsk.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 04/13] net: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV.
2025-07-21 20:35 ` [PATCH v1 net-next 04/13] net: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV Kuniyuki Iwashima
@ 2025-07-22 14:37 ` Eric Dumazet
0 siblings, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2025-07-22 14:37 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Jakub Kicinski, Neal Cardwell, Paolo Abeni,
Willem de Bruijn, Matthieu Baerts, Mat Martineau, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 21, 2025 at 1:36 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> Initially, trace_sock_exceed_buf_limit() was invoked when
> __sk_mem_raise_allocated() failed due to the memcg limit or the
> global limit.
>
> However, commit d6f19938eb031 ("net: expose sk wmem in
> sock_exceed_buf_limit tracepoint") somehow suppressed the event
> only when memcg failed to charge for SK_MEM_RECV, although the
> memcg failure for SK_MEM_SEND still triggers the event.
>
> Let's restore the event for SK_MEM_RECV.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 05/13] net: Clean up __sk_mem_raise_allocated().
2025-07-21 20:35 ` [PATCH v1 net-next 05/13] net: Clean up __sk_mem_raise_allocated() Kuniyuki Iwashima
@ 2025-07-22 14:38 ` Eric Dumazet
0 siblings, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2025-07-22 14:38 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Jakub Kicinski, Neal Cardwell, Paolo Abeni,
Willem de Bruijn, Matthieu Baerts, Mat Martineau, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 21, 2025 at 1:36 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> In __sk_mem_raise_allocated(), charged is initialised as true due
> to the weird condition removed in the previous patch.
>
> It makes the variable unreliable by itself, so we have to check
> another variable, memcg, in advance.
>
> Also, we will factorise the common check below for memcg later.
>
> if (mem_cgroup_sockets_enabled && sk->sk_memcg)
>
> As a prep, let's initialise charged as false and memcg as NULL.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 06/13] net-memcg: Introduce mem_cgroup_from_sk().
2025-07-21 20:35 ` [PATCH v1 net-next 06/13] net-memcg: Introduce mem_cgroup_from_sk() Kuniyuki Iwashima
@ 2025-07-22 14:39 ` Eric Dumazet
0 siblings, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2025-07-22 14:39 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Jakub Kicinski, Neal Cardwell, Paolo Abeni,
Willem de Bruijn, Matthieu Baerts, Mat Martineau, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 21, 2025 at 1:36 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> We will store a flag in the lowest bit of sk->sk_memcg.
>
> Then, directly dereferencing sk->sk_memcg will be illegal, and we
> do not want to allow touching the raw sk->sk_memcg in many places.
>
> Let's introduce mem_cgroup_from_sk().
>
> Other places accessing the raw sk->sk_memcg will be converted later.
>
> Note that we cannot define the helper as an inline function in
> memcontrol.h as we cannot access any fields of struct sock there
> due to circular dependency, so it is placed in sock.h.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 07/13] net-memcg: Introduce mem_cgroup_sk_enabled().
2025-07-21 20:35 ` [PATCH v1 net-next 07/13] net-memcg: Introduce mem_cgroup_sk_enabled() Kuniyuki Iwashima
@ 2025-07-22 14:40 ` Eric Dumazet
0 siblings, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2025-07-22 14:40 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Jakub Kicinski, Neal Cardwell, Paolo Abeni,
Willem de Bruijn, Matthieu Baerts, Mat Martineau, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 21, 2025 at 1:36 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> The socket memcg feature is enabled by a static key and
> only works for non-root cgroup.
>
> We check both conditions in many places.
>
> Let's factorise it as a helper function.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 08/13] net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge().
2025-07-21 20:35 ` [PATCH v1 net-next 08/13] net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge() Kuniyuki Iwashima
@ 2025-07-22 14:56 ` Eric Dumazet
0 siblings, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2025-07-22 14:56 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Jakub Kicinski, Neal Cardwell, Paolo Abeni,
Willem de Bruijn, Matthieu Baerts, Mat Martineau, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 21, 2025 at 1:36 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> We will store a flag in the lowest bit of sk->sk_memcg.
>
> Then, we cannot pass the raw pointer to mem_cgroup_charge_skmem()
> and mem_cgroup_uncharge_skmem().
>
> Let's pass struct sock to the functions.
>
> While at it, they are renamed to match other functions starting
> with mem_cgroup_sk_.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> ---
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 09/13] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure().
2025-07-21 20:35 ` [PATCH v1 net-next 09/13] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure() Kuniyuki Iwashima
@ 2025-07-22 14:58 ` Eric Dumazet
0 siblings, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2025-07-22 14:58 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Jakub Kicinski, Neal Cardwell, Paolo Abeni,
Willem de Bruijn, Matthieu Baerts, Mat Martineau, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 21, 2025 at 1:36 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> We will store a flag in the lowest bit of sk->sk_memcg.
>
> Then, we cannot pass the raw pointer to mem_cgroup_under_socket_pressure().
>
> Let's pass struct sock to it and rename the function to match other
> functions starting with mem_cgroup_sk_.
>
> Note that the helper is moved to sock.h to use mem_cgroup_from_sk().
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 10/13] net: Define sk_memcg under CONFIG_MEMCG.
2025-07-21 20:35 ` [PATCH v1 net-next 10/13] net: Define sk_memcg under CONFIG_MEMCG Kuniyuki Iwashima
@ 2025-07-22 14:58 ` Eric Dumazet
0 siblings, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2025-07-22 14:58 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Jakub Kicinski, Neal Cardwell, Paolo Abeni,
Willem de Bruijn, Matthieu Baerts, Mat Martineau, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 21, 2025 at 1:36 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> Except for sk_clone_lock(), all accesses to sk->sk_memcg
> is done under CONFIG_MEMCG.
>
> As a bonus, let's define sk->sk_memcg under CONFIG_MEMCG.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 11/13] net-memcg: Add memory.socket_isolated knob.
2025-07-21 20:35 ` [PATCH v1 net-next 11/13] net-memcg: Add memory.socket_isolated knob Kuniyuki Iwashima
@ 2025-07-22 15:00 ` Eric Dumazet
2025-07-31 13:39 ` Michal Koutný
1 sibling, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2025-07-22 15:00 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Jakub Kicinski, Neal Cardwell, Paolo Abeni,
Willem de Bruijn, Matthieu Baerts, Mat Martineau, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 21, 2025 at 1:36 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> Some networking protocols have their own global memory accounting,
> and such memory is also charged to memcg as sock in memory.stat.
>
> Such sockets are subject to the global limit, thus affected by a
> noisy neighbour outside the cgroup.
>
> We will decouple the global memory accounting if configured.
>
> Let's add a per-memcg knob to control that.
>
> The value will be saved in each socket when created and will
> persist through the socket's lifetime.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> ---
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 12/13] net-memcg: Store memcg->socket_isolated in sk->sk_memcg.
2025-07-21 20:35 ` [PATCH v1 net-next 12/13] net-memcg: Store memcg->socket_isolated in sk->sk_memcg Kuniyuki Iwashima
@ 2025-07-22 15:02 ` Eric Dumazet
0 siblings, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2025-07-22 15:02 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Jakub Kicinski, Neal Cardwell, Paolo Abeni,
Willem de Bruijn, Matthieu Baerts, Mat Martineau, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 21, 2025 at 1:36 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> memcg->socket_isolated can change at any time, so we must
> snapshot the value for each socket to ensure consistency.
>
> Given sk->sk_memcg can be accessed in the fast path, it would
> be preferable to place the flag field in the same cache line
> as sk->sk_memcg.
>
> However, struct sock does not have such a 1-byte hole.
>
> Let's store the flag in the lowest bit of sk->sk_memcg and
> add a helper to check the bit.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated.
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
` (12 preceding siblings ...)
2025-07-21 20:35 ` [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
@ 2025-07-22 15:04 ` Shakeel Butt
2025-07-22 15:34 ` Eric Dumazet
13 siblings, 1 reply; 52+ messages in thread
From: Shakeel Butt @ 2025-07-22 15:04 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 21, 2025 at 08:35:19PM +0000, Kuniyuki Iwashima wrote:
> Some protocols (e.g., TCP, UDP) has their own memory accounting for
> socket buffers and charge memory to global per-protocol counters such
> as /proc/net/ipv4/tcp_mem.
>
> When running under a non-root cgroup, this memory is also charged to
> the memcg as sock in memory.stat.
>
> Sockets using such protocols are still subject to the global limits,
> thus affected by a noisy neighbour outside cgroup.
>
> This makes it difficult to accurately estimate and configure appropriate
> global limits.
>
> If all workloads were guaranteed to be controlled under memcg, the issue
> can be worked around by setting tcp_mem[0~2] to UINT_MAX.
>
> However, this assumption does not always hold, and a single workload that
> opts out of memcg can consume memory up to the global limit, which is
> problematic.
>
> This series introduces a new per-memcg know to allow decoupling memcg
> from the global memory accounting, which simplifies the memcg
> configuration while keeping the global limits within a reasonable range.
Sorry, the above para is confusing. What is per-memcg know? Or maybe it
is knob. Also please go a bit in more detail how decoupling helps the
global limits within a reasonable range?
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-21 20:35 ` [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
@ 2025-07-22 15:14 ` Shakeel Butt
2025-07-22 15:24 ` Eric Dumazet
2025-07-28 16:07 ` Johannes Weiner
` (2 subsequent siblings)
3 siblings, 1 reply; 52+ messages in thread
From: Shakeel Butt @ 2025-07-22 15:14 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> buffers and charge memory to per-protocol global counters pointed to by
> sk->sk_proto->memory_allocated.
>
> When running under a non-root cgroup, this memory is also charged to the
> memcg as sock in memory.stat.
>
> Even when memory usage is controlled by memcg, sockets using such protocols
> are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
>
> This makes it difficult to accurately estimate and configure appropriate
> global limits, especially in multi-tenant environments.
>
> If all workloads were guaranteed to be controlled under memcg, the issue
> could be worked around by setting tcp_mem[0~2] to UINT_MAX.
>
> In reality, this assumption does not always hold, and a single workload
> that opts out of memcg can consume memory up to the global limit,
> becoming a noisy neighbour.
>
Sorry but the above is not reasonable. On a multi-tenant system no
workload should be able to opt out of memcg accounting if isolation is
needed. If a workload can opt out then there is no guarantee.
In addition please avoid adding a per-memcg knob. Why not have system
level setting for the decoupling. I would say start with a build time
config setting or boot parameter then if really needed we can discuss if
system level setting is needed which can be toggled at runtime though
there might be challenges there.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-22 15:14 ` Shakeel Butt
@ 2025-07-22 15:24 ` Eric Dumazet
2025-07-22 15:52 ` Shakeel Butt
0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2025-07-22 15:24 UTC (permalink / raw)
To: Shakeel Butt
Cc: Kuniyuki Iwashima, David S. Miller, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Tue, Jul 22, 2025 at 8:14 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > buffers and charge memory to per-protocol global counters pointed to by
> > sk->sk_proto->memory_allocated.
> >
> > When running under a non-root cgroup, this memory is also charged to the
> > memcg as sock in memory.stat.
> >
> > Even when memory usage is controlled by memcg, sockets using such protocols
> > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> >
> > This makes it difficult to accurately estimate and configure appropriate
> > global limits, especially in multi-tenant environments.
> >
> > If all workloads were guaranteed to be controlled under memcg, the issue
> > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> >
> > In reality, this assumption does not always hold, and a single workload
> > that opts out of memcg can consume memory up to the global limit,
> > becoming a noisy neighbour.
> >
>
> Sorry but the above is not reasonable. On a multi-tenant system no
> workload should be able to opt out of memcg accounting if isolation is
> needed. If a workload can opt out then there is no guarantee.
Deployment issue ?
In a multi-tenant system you can not suddenly force all workloads to
be TCP memcg charged. This has caused many OMG.
Also, the current situation of maintaining two limits (memcg one, plus
global tcp_memory_allocated) is very inefficient.
If we trust memcg, then why have an expensive safety belt ?
With this series, we can finally use one or the other limit. This
should have been done from day-0 really.
>
> In addition please avoid adding a per-memcg knob. Why not have system
> level setting for the decoupling. I would say start with a build time
> config setting or boot parameter then if really needed we can discuss if
> system level setting is needed which can be toggled at runtime though
> there might be challenges there.
Built time or boot parameter ? I fail to see how it can be more convenient.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated.
2025-07-22 15:04 ` [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Shakeel Butt
@ 2025-07-22 15:34 ` Eric Dumazet
0 siblings, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2025-07-22 15:34 UTC (permalink / raw)
To: Shakeel Butt
Cc: Kuniyuki Iwashima, David S. Miller, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Tue, Jul 22, 2025 at 8:04 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Jul 21, 2025 at 08:35:19PM +0000, Kuniyuki Iwashima wrote:
> > Some protocols (e.g., TCP, UDP) has their own memory accounting for
> > socket buffers and charge memory to global per-protocol counters such
> > as /proc/net/ipv4/tcp_mem.
> >
> > When running under a non-root cgroup, this memory is also charged to
> > the memcg as sock in memory.stat.
> >
> > Sockets using such protocols are still subject to the global limits,
> > thus affected by a noisy neighbour outside cgroup.
> >
> > This makes it difficult to accurately estimate and configure appropriate
> > global limits.
> >
> > If all workloads were guaranteed to be controlled under memcg, the issue
> > can be worked around by setting tcp_mem[0~2] to UINT_MAX.
> >
> > However, this assumption does not always hold, and a single workload that
> > opts out of memcg can consume memory up to the global limit, which is
> > problematic.
> >
> > This series introduces a new per-memcg know to allow decoupling memcg
> > from the global memory accounting, which simplifies the memcg
> > configuration while keeping the global limits within a reasonable range.
>
> Sorry, the above para is confusing. What is per-memcg know? Or maybe it
> is knob. Also please go a bit in more detail how decoupling helps the
> global limits within a reasonable range?
The intent is to no longer have to increase tcp_mem[0..2] just to
allow a big job to use 90 % of physical memory all for TCP sockets and
buffers.
Leave the linux default values. They have been considered reasonable
for decades.
They will only be used by applications not using memcg to limit TCP
memory usage.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-22 15:24 ` Eric Dumazet
@ 2025-07-22 15:52 ` Shakeel Butt
2025-07-22 18:18 ` Kuniyuki Iwashima
0 siblings, 1 reply; 52+ messages in thread
From: Shakeel Butt @ 2025-07-22 15:52 UTC (permalink / raw)
To: Eric Dumazet
Cc: Kuniyuki Iwashima, David S. Miller, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Tue, Jul 22, 2025 at 08:24:23AM -0700, Eric Dumazet wrote:
> On Tue, Jul 22, 2025 at 8:14 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > > buffers and charge memory to per-protocol global counters pointed to by
> > > sk->sk_proto->memory_allocated.
> > >
> > > When running under a non-root cgroup, this memory is also charged to the
> > > memcg as sock in memory.stat.
> > >
> > > Even when memory usage is controlled by memcg, sockets using such protocols
> > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> > >
> > > This makes it difficult to accurately estimate and configure appropriate
> > > global limits, especially in multi-tenant environments.
> > >
> > > If all workloads were guaranteed to be controlled under memcg, the issue
> > > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> > >
> > > In reality, this assumption does not always hold, and a single workload
> > > that opts out of memcg can consume memory up to the global limit,
> > > becoming a noisy neighbour.
> > >
> >
> > Sorry but the above is not reasonable. On a multi-tenant system no
> > workload should be able to opt out of memcg accounting if isolation is
> > needed. If a workload can opt out then there is no guarantee.
>
> Deployment issue ?
>
> In a multi-tenant system you can not suddenly force all workloads to
> be TCP memcg charged. This has caused many OMG.
Let's discuss the above at the end.
>
> Also, the current situation of maintaining two limits (memcg one, plus
> global tcp_memory_allocated) is very inefficient.
Agree.
>
> If we trust memcg, then why have an expensive safety belt ?
>
> With this series, we can finally use one or the other limit. This
> should have been done from day-0 really.
Same, I agree.
>
> >
> > In addition please avoid adding a per-memcg knob. Why not have system
> > level setting for the decoupling. I would say start with a build time
> > config setting or boot parameter then if really needed we can discuss if
> > system level setting is needed which can be toggled at runtime though
> > there might be challenges there.
>
> Built time or boot parameter ? I fail to see how it can be more convenient.
I think we agree on decoupling the global and memcg accounting of
network memory. I am still not clear on the need of per-memcg knob. From
the earlier comment, it seems like you want mix of jobs with memcg
limited network memory accounting and with global network accounting
running concurrently on a system. Is that correct?
I expect this state of jobs with different network accounting config
running concurrently is temporary while the migrationg from one to other
is happening. Please correct me if I am wrong.
My main concern with the memcg knob is that it is permanent and it
requires a hierarchical semantics. No need to add a permanent interface
for a temporary need and I don't see a clear hierarchical semantic for
this interface.
I am wondering if alternative approches for per-workload settings are
explore starting with BPF.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-22 15:52 ` Shakeel Butt
@ 2025-07-22 18:18 ` Kuniyuki Iwashima
2025-07-22 18:47 ` Shakeel Butt
0 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-22 18:18 UTC (permalink / raw)
To: Shakeel Butt
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Tue, Jul 22, 2025 at 8:52 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Jul 22, 2025 at 08:24:23AM -0700, Eric Dumazet wrote:
> > On Tue, Jul 22, 2025 at 8:14 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> > > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > > > buffers and charge memory to per-protocol global counters pointed to by
> > > > sk->sk_proto->memory_allocated.
> > > >
> > > > When running under a non-root cgroup, this memory is also charged to the
> > > > memcg as sock in memory.stat.
> > > >
> > > > Even when memory usage is controlled by memcg, sockets using such protocols
> > > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> > > >
> > > > This makes it difficult to accurately estimate and configure appropriate
> > > > global limits, especially in multi-tenant environments.
> > > >
> > > > If all workloads were guaranteed to be controlled under memcg, the issue
> > > > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> > > >
> > > > In reality, this assumption does not always hold, and a single workload
> > > > that opts out of memcg can consume memory up to the global limit,
> > > > becoming a noisy neighbour.
> > > >
> > >
> > > Sorry but the above is not reasonable. On a multi-tenant system no
> > > workload should be able to opt out of memcg accounting if isolation is
> > > needed. If a workload can opt out then there is no guarantee.
> >
> > Deployment issue ?
> >
> > In a multi-tenant system you can not suddenly force all workloads to
> > be TCP memcg charged. This has caused many OMG.
>
> Let's discuss the above at the end.
>
> >
> > Also, the current situation of maintaining two limits (memcg one, plus
> > global tcp_memory_allocated) is very inefficient.
>
> Agree.
>
> >
> > If we trust memcg, then why have an expensive safety belt ?
> >
> > With this series, we can finally use one or the other limit. This
> > should have been done from day-0 really.
>
> Same, I agree.
>
> >
> > >
> > > In addition please avoid adding a per-memcg knob. Why not have system
> > > level setting for the decoupling. I would say start with a build time
> > > config setting or boot parameter then if really needed we can discuss if
> > > system level setting is needed which can be toggled at runtime though
> > > there might be challenges there.
> >
> > Built time or boot parameter ? I fail to see how it can be more convenient.
>
> I think we agree on decoupling the global and memcg accounting of
> network memory. I am still not clear on the need of per-memcg knob. From
> the earlier comment, it seems like you want mix of jobs with memcg
> limited network memory accounting and with global network accounting
> running concurrently on a system. Is that correct?
Correct.
>
> I expect this state of jobs with different network accounting config
> running concurrently is temporary while the migrationg from one to other
> is happening. Please correct me if I am wrong.
We need to migrate workload gradually and the system-wide config
does not work at all. AFAIU, there are already years of effort spent
on the migration but it's not yet completed at Google. So, I don't think
the need is temporary.
>
> My main concern with the memcg knob is that it is permanent and it
> requires a hierarchical semantics. No need to add a permanent interface
> for a temporary need and I don't see a clear hierarchical semantic for
> this interface.
I don't see merits of having hierarchical semantics for this knob.
Regardless of this knob, hierarchical semantics is guaranteed
by other knobs. I think such semantics for this knob just complicates
the code with no gain.
>
> I am wondering if alternative approches for per-workload settings are
> explore starting with BPF.
>
>
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-22 18:18 ` Kuniyuki Iwashima
@ 2025-07-22 18:47 ` Shakeel Butt
2025-07-22 19:03 ` Kuniyuki Iwashima
0 siblings, 1 reply; 52+ messages in thread
From: Shakeel Butt @ 2025-07-22 18:47 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote:
> >
> > I expect this state of jobs with different network accounting config
> > running concurrently is temporary while the migrationg from one to other
> > is happening. Please correct me if I am wrong.
>
> We need to migrate workload gradually and the system-wide config
> does not work at all. AFAIU, there are already years of effort spent
> on the migration but it's not yet completed at Google. So, I don't think
> the need is temporary.
>
From what I remembered shared borg had completely moved to memcg
accounting of network memory (with sys container as an exception) years
ago. Did something change there?
> >
> > My main concern with the memcg knob is that it is permanent and it
> > requires a hierarchical semantics. No need to add a permanent interface
> > for a temporary need and I don't see a clear hierarchical semantic for
> > this interface.
>
> I don't see merits of having hierarchical semantics for this knob.
> Regardless of this knob, hierarchical semantics is guaranteed
> by other knobs. I think such semantics for this knob just complicates
> the code with no gain.
>
Cgroup interfaces are hierarchical and we want to keep it that way.
Putting non-hierarchical interfaces just makes configuration and setup
hard to reason about.
>
> >
> > I am wondering if alternative approches for per-workload settings are
> > explore starting with BPF.
> >
Any response on the above? Any alternative approaches explored?
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-22 18:47 ` Shakeel Butt
@ 2025-07-22 19:03 ` Kuniyuki Iwashima
2025-07-22 19:56 ` Shakeel Butt
0 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-22 19:03 UTC (permalink / raw)
To: Shakeel Butt
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote:
> > >
> > > I expect this state of jobs with different network accounting config
> > > running concurrently is temporary while the migrationg from one to other
> > > is happening. Please correct me if I am wrong.
> >
> > We need to migrate workload gradually and the system-wide config
> > does not work at all. AFAIU, there are already years of effort spent
> > on the migration but it's not yet completed at Google. So, I don't think
> > the need is temporary.
> >
>
> From what I remembered shared borg had completely moved to memcg
> accounting of network memory (with sys container as an exception) years
> ago. Did something change there?
AFAICS, there are some workloads that opted out from memcg and
consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering
OOM and disrupting other workloads.
>
> > >
> > > My main concern with the memcg knob is that it is permanent and it
> > > requires a hierarchical semantics. No need to add a permanent interface
> > > for a temporary need and I don't see a clear hierarchical semantic for
> > > this interface.
> >
> > I don't see merits of having hierarchical semantics for this knob.
> > Regardless of this knob, hierarchical semantics is guaranteed
> > by other knobs. I think such semantics for this knob just complicates
> > the code with no gain.
> >
>
> Cgroup interfaces are hierarchical and we want to keep it that way.
> Putting non-hierarchical interfaces just makes configuration and setup
> hard to reason about.
Actually, I tried that way in the initial draft version, but even if the
parent's knob is 1 and child one is 0, a harmful scenario didn't come
to my mind.
>
> >
> > >
> > > I am wondering if alternative approches for per-workload settings are
> > > explore starting with BPF.
> > >
>
> Any response on the above? Any alternative approaches explored?
Do you mean flagging each socket by BPF at cgroup hook ?
I think it's overkill and we don't need such finer granularity.
Also it sounds way too hacky to use BPF to correct the weird
behaviour from day0. We should have more generic way to
control that. I know this functionality is helpful for some workloads
at Amazon as well.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-22 19:03 ` Kuniyuki Iwashima
@ 2025-07-22 19:56 ` Shakeel Butt
2025-07-22 21:59 ` Kuniyuki Iwashima
0 siblings, 1 reply; 52+ messages in thread
From: Shakeel Butt @ 2025-07-22 19:56 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Tue, Jul 22, 2025 at 12:03:48PM -0700, Kuniyuki Iwashima wrote:
> On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote:
> > > >
> > > > I expect this state of jobs with different network accounting config
> > > > running concurrently is temporary while the migrationg from one to other
> > > > is happening. Please correct me if I am wrong.
> > >
> > > We need to migrate workload gradually and the system-wide config
> > > does not work at all. AFAIU, there are already years of effort spent
> > > on the migration but it's not yet completed at Google. So, I don't think
> > > the need is temporary.
> > >
> >
> > From what I remembered shared borg had completely moved to memcg
> > accounting of network memory (with sys container as an exception) years
> > ago. Did something change there?
>
> AFAICS, there are some workloads that opted out from memcg and
> consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering
> OOM and disrupting other workloads.
>
What were the reasons behind opting out? We should fix those
instead of a permanent opt-out option.
> >
> > > >
> > > > My main concern with the memcg knob is that it is permanent and it
> > > > requires a hierarchical semantics. No need to add a permanent interface
> > > > for a temporary need and I don't see a clear hierarchical semantic for
> > > > this interface.
> > >
> > > I don't see merits of having hierarchical semantics for this knob.
> > > Regardless of this knob, hierarchical semantics is guaranteed
> > > by other knobs. I think such semantics for this knob just complicates
> > > the code with no gain.
> > >
> >
> > Cgroup interfaces are hierarchical and we want to keep it that way.
> > Putting non-hierarchical interfaces just makes configuration and setup
> > hard to reason about.
>
> Actually, I tried that way in the initial draft version, but even if the
> parent's knob is 1 and child one is 0, a harmful scenario didn't come
> to my mind.
>
It is not just about harmful scenario but more about clear semantics.
Check memory.zswap.writeback semantics.
>
> >
> > >
> > > >
> > > > I am wondering if alternative approches for per-workload settings are
> > > > explore starting with BPF.
> > > >
> >
> > Any response on the above? Any alternative approaches explored?
>
> Do you mean flagging each socket by BPF at cgroup hook ?
Not sure. Will it not be very similar to your current approach? Each
socket is associated with a memcg and the at the place where you need to
check which accounting method to use, just check that memcg setting in
bpf and you can cache the result in socket as well.
>
> I think it's overkill and we don't need such finer granularity.
>
> Also it sounds way too hacky to use BPF to correct the weird
> behaviour from day0.
What weird behavior? Two accounting mechanisms. Yes I agree but memcgs
with different accounting mechanisms concurrently is also weird.
> We should have more generic way to
> control that. I know this functionality is helpful for some workloads
> at Amazon as well.
The reason I am against this permanent opt-out interface is if we add
this interface then we will never fix the underlying issues blocking the
full conversion to memcg accounting of network memory. I am ok with some
temporary measures to allow opt-out impacted workload until the
underlying issue is fixed.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-22 19:56 ` Shakeel Butt
@ 2025-07-22 21:59 ` Kuniyuki Iwashima
2025-07-23 0:29 ` Shakeel Butt
0 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-22 21:59 UTC (permalink / raw)
To: Shakeel Butt
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Tue, Jul 22, 2025 at 12:56 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Jul 22, 2025 at 12:03:48PM -0700, Kuniyuki Iwashima wrote:
> > On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote:
> > > > >
> > > > > I expect this state of jobs with different network accounting config
> > > > > running concurrently is temporary while the migrationg from one to other
> > > > > is happening. Please correct me if I am wrong.
> > > >
> > > > We need to migrate workload gradually and the system-wide config
> > > > does not work at all. AFAIU, there are already years of effort spent
> > > > on the migration but it's not yet completed at Google. So, I don't think
> > > > the need is temporary.
> > > >
> > >
> > > From what I remembered shared borg had completely moved to memcg
> > > accounting of network memory (with sys container as an exception) years
> > > ago. Did something change there?
> >
> > AFAICS, there are some workloads that opted out from memcg and
> > consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering
> > OOM and disrupting other workloads.
> >
>
> What were the reasons behind opting out? We should fix those
> instead of a permanent opt-out option.
>
> > >
> > > > >
> > > > > My main concern with the memcg knob is that it is permanent and it
> > > > > requires a hierarchical semantics. No need to add a permanent interface
> > > > > for a temporary need and I don't see a clear hierarchical semantic for
> > > > > this interface.
> > > >
> > > > I don't see merits of having hierarchical semantics for this knob.
> > > > Regardless of this knob, hierarchical semantics is guaranteed
> > > > by other knobs. I think such semantics for this knob just complicates
> > > > the code with no gain.
> > > >
> > >
> > > Cgroup interfaces are hierarchical and we want to keep it that way.
> > > Putting non-hierarchical interfaces just makes configuration and setup
> > > hard to reason about.
> >
> > Actually, I tried that way in the initial draft version, but even if the
> > parent's knob is 1 and child one is 0, a harmful scenario didn't come
> > to my mind.
> >
>
> It is not just about harmful scenario but more about clear semantics.
> Check memory.zswap.writeback semantics.
zswap checks all parent cgroups when evaluating the knob, but
this is not an option for the networking fast path as we cannot
check them for every skb, which will degrade the performance.
Also, we don't track which sockets were created with the knob
enabled and how many such sockets are still left under the cgroup,
there is no way to keep options consistent throughout the hierarchy
and no need to try hard to make the option pretend to be consistent
if there's no real issue.
>
> >
> > >
> > > >
> > > > >
> > > > > I am wondering if alternative approches for per-workload settings are
> > > > > explore starting with BPF.
> > > > >
> > >
> > > Any response on the above? Any alternative approaches explored?
> >
> > Do you mean flagging each socket by BPF at cgroup hook ?
>
> Not sure. Will it not be very similar to your current approach? Each
> socket is associated with a memcg and the at the place where you need to
> check which accounting method to use, just check that memcg setting in
> bpf and you can cache the result in socket as well.
The socket pointer is not writable by default, thus we need to add
a bpf helper or kfunc just for flipping a single bit. As said, this is
overkill, and per-memcg knob is much simpler.
>
> >
> > I think it's overkill and we don't need such finer granularity.
> >
> > Also it sounds way too hacky to use BPF to correct the weird
> > behaviour from day0.
>
> What weird behavior? Two accounting mechanisms. Yes I agree but memcgs
> with different accounting mechanisms concurrently is also weird.
Not that weird given the root cgroup does not allocate sk->sk_memcg
and are subject to the global tcp memory accounting. We already have
a mixed set of memcgs.
Also, not every cgroup sets memory limits. systemd puts some
processes to a non-root cgroup by default without setting memory.max.
In such a case we definitely want the global memory accounting to take
place.
Having to set memory.max to every non-root cgroup is less flexible
and too restricted.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-22 21:59 ` Kuniyuki Iwashima
@ 2025-07-23 0:29 ` Shakeel Butt
2025-07-23 2:35 ` Kuniyuki Iwashima
0 siblings, 1 reply; 52+ messages in thread
From: Shakeel Butt @ 2025-07-23 0:29 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Tue, Jul 22, 2025 at 02:59:33PM -0700, Kuniyuki Iwashima wrote:
> On Tue, Jul 22, 2025 at 12:56 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Tue, Jul 22, 2025 at 12:03:48PM -0700, Kuniyuki Iwashima wrote:
> > > On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote:
> > > > > >
> > > > > > I expect this state of jobs with different network accounting config
> > > > > > running concurrently is temporary while the migrationg from one to other
> > > > > > is happening. Please correct me if I am wrong.
> > > > >
> > > > > We need to migrate workload gradually and the system-wide config
> > > > > does not work at all. AFAIU, there are already years of effort spent
> > > > > on the migration but it's not yet completed at Google. So, I don't think
> > > > > the need is temporary.
> > > > >
> > > >
> > > > From what I remembered shared borg had completely moved to memcg
> > > > accounting of network memory (with sys container as an exception) years
> > > > ago. Did something change there?
> > >
> > > AFAICS, there are some workloads that opted out from memcg and
> > > consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering
> > > OOM and disrupting other workloads.
> > >
> >
> > What were the reasons behind opting out? We should fix those
> > instead of a permanent opt-out option.
> >
Any response to the above?
> > > >
> > > > > >
> > > > > > My main concern with the memcg knob is that it is permanent and it
> > > > > > requires a hierarchical semantics. No need to add a permanent interface
> > > > > > for a temporary need and I don't see a clear hierarchical semantic for
> > > > > > this interface.
> > > > >
> > > > > I don't see merits of having hierarchical semantics for this knob.
> > > > > Regardless of this knob, hierarchical semantics is guaranteed
> > > > > by other knobs. I think such semantics for this knob just complicates
> > > > > the code with no gain.
> > > > >
> > > >
> > > > Cgroup interfaces are hierarchical and we want to keep it that way.
> > > > Putting non-hierarchical interfaces just makes configuration and setup
> > > > hard to reason about.
> > >
> > > Actually, I tried that way in the initial draft version, but even if the
> > > parent's knob is 1 and child one is 0, a harmful scenario didn't come
> > > to my mind.
> > >
> >
> > It is not just about harmful scenario but more about clear semantics.
> > Check memory.zswap.writeback semantics.
>
> zswap checks all parent cgroups when evaluating the knob, but
> this is not an option for the networking fast path as we cannot
> check them for every skb, which will degrade the performance.
That's an implementation detail and you can definitely optimize it. One
possible way might be caching the state in socket at creation time which
puts some restrictions like to change the config, workload needs to be
restarted.
>
> Also, we don't track which sockets were created with the knob
> enabled and how many such sockets are still left under the cgroup,
> there is no way to keep options consistent throughout the hierarchy
> and no need to try hard to make the option pretend to be consistent
> if there's no real issue.
>
>
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > I am wondering if alternative approches for per-workload settings are
> > > > > > explore starting with BPF.
> > > > > >
> > > >
> > > > Any response on the above? Any alternative approaches explored?
> > >
> > > Do you mean flagging each socket by BPF at cgroup hook ?
> >
> > Not sure. Will it not be very similar to your current approach? Each
> > socket is associated with a memcg and the at the place where you need to
> > check which accounting method to use, just check that memcg setting in
> > bpf and you can cache the result in socket as well.
>
> The socket pointer is not writable by default, thus we need to add
> a bpf helper or kfunc just for flipping a single bit. As said, this is
> overkill, and per-memcg knob is much simpler.
>
Your simple solution is exposing a stable permanent user facing API
which I suspect is temporary situation. Let's discuss it at the end.
>
> >
> > >
> > > I think it's overkill and we don't need such finer granularity.
> > >
> > > Also it sounds way too hacky to use BPF to correct the weird
> > > behaviour from day0.
> >
> > What weird behavior? Two accounting mechanisms. Yes I agree but memcgs
> > with different accounting mechanisms concurrently is also weird.
>
> Not that weird given the root cgroup does not allocate sk->sk_memcg
> and are subject to the global tcp memory accounting. We already have
> a mixed set of memcgs.
Running workloads in root cgroup is not normal and comes with a warning
of no isolation provided.
I looked at the patch again to understand the modes you are introducing.
Initially, I thought the series introduced multiple modes, including an
option to exclude network memory from memcg accounting. However, if I
understand correctly, that is not the case—the opt-out applies only to
the global TCP/UDP accounting. That’s a relief, and I apologize for the
misunderstanding.
If I’m correct, you need a way to exclude a workload from the global
TCP/UDP accounting, and currently, memcg serves as a convenient
abstraction for the workload. Please let me know if I misunderstood.
Now memcg is one way to represent the workload. Another more natural, at
least to me, is the core cgroup. Basically cgroup.something interface.
BPF is yet another option.
To me cgroup seems preferrable but let's see what other memcg & cgroup
folks think. Also note that for cgroup and memcg the interface will need
to be hierarchical.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-23 0:29 ` Shakeel Butt
@ 2025-07-23 2:35 ` Kuniyuki Iwashima
2025-07-23 17:28 ` Shakeel Butt
0 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-23 2:35 UTC (permalink / raw)
To: Shakeel Butt
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Tue, Jul 22, 2025 at 5:29 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Jul 22, 2025 at 02:59:33PM -0700, Kuniyuki Iwashima wrote:
> > On Tue, Jul 22, 2025 at 12:56 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Tue, Jul 22, 2025 at 12:03:48PM -0700, Kuniyuki Iwashima wrote:
> > > > On Tue, Jul 22, 2025 at 11:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > > >
> > > > > On Tue, Jul 22, 2025 at 11:18:40AM -0700, Kuniyuki Iwashima wrote:
> > > > > > >
> > > > > > > I expect this state of jobs with different network accounting config
> > > > > > > running concurrently is temporary while the migrationg from one to other
> > > > > > > is happening. Please correct me if I am wrong.
> > > > > >
> > > > > > We need to migrate workload gradually and the system-wide config
> > > > > > does not work at all. AFAIU, there are already years of effort spent
> > > > > > on the migration but it's not yet completed at Google. So, I don't think
> > > > > > the need is temporary.
> > > > > >
> > > > >
> > > > > From what I remembered shared borg had completely moved to memcg
> > > > > accounting of network memory (with sys container as an exception) years
> > > > > ago. Did something change there?
> > > >
> > > > AFAICS, there are some workloads that opted out from memcg and
> > > > consumed too much tcp memory due to tcp_mem=UINT_MAX, triggering
> > > > OOM and disrupting other workloads.
> > > >
> > >
> > > What were the reasons behind opting out? We should fix those
> > > instead of a permanent opt-out option.
> > >
>
> Any response to the above?
I'm just checking with internal folks, not sure if I will follow up on
this though, see below.
>
> > > > >
> > > > > > >
> > > > > > > My main concern with the memcg knob is that it is permanent and it
> > > > > > > requires a hierarchical semantics. No need to add a permanent interface
> > > > > > > for a temporary need and I don't see a clear hierarchical semantic for
> > > > > > > this interface.
> > > > > >
> > > > > > I don't see merits of having hierarchical semantics for this knob.
> > > > > > Regardless of this knob, hierarchical semantics is guaranteed
> > > > > > by other knobs. I think such semantics for this knob just complicates
> > > > > > the code with no gain.
> > > > > >
> > > > >
> > > > > Cgroup interfaces are hierarchical and we want to keep it that way.
> > > > > Putting non-hierarchical interfaces just makes configuration and setup
> > > > > hard to reason about.
> > > >
> > > > Actually, I tried that way in the initial draft version, but even if the
> > > > parent's knob is 1 and child one is 0, a harmful scenario didn't come
> > > > to my mind.
> > > >
> > >
> > > It is not just about harmful scenario but more about clear semantics.
> > > Check memory.zswap.writeback semantics.
> >
> > zswap checks all parent cgroups when evaluating the knob, but
> > this is not an option for the networking fast path as we cannot
> > check them for every skb, which will degrade the performance.
>
> That's an implementation detail and you can definitely optimize it. One
> possible way might be caching the state in socket at creation time which
> puts some restrictions like to change the config, workload needs to be
> restarted.
>
> >
> > Also, we don't track which sockets were created with the knob
> > enabled and how many such sockets are still left under the cgroup,
> > there is no way to keep options consistent throughout the hierarchy
> > and no need to try hard to make the option pretend to be consistent
> > if there's no real issue.
> >
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > I am wondering if alternative approches for per-workload settings are
> > > > > > > explore starting with BPF.
> > > > > > >
> > > > >
> > > > > Any response on the above? Any alternative approaches explored?
> > > >
> > > > Do you mean flagging each socket by BPF at cgroup hook ?
> > >
> > > Not sure. Will it not be very similar to your current approach? Each
> > > socket is associated with a memcg and the at the place where you need to
> > > check which accounting method to use, just check that memcg setting in
> > > bpf and you can cache the result in socket as well.
> >
> > The socket pointer is not writable by default, thus we need to add
> > a bpf helper or kfunc just for flipping a single bit. As said, this is
> > overkill, and per-memcg knob is much simpler.
> >
>
> Your simple solution is exposing a stable permanent user facing API
> which I suspect is temporary situation. Let's discuss it at the end.
>
> >
> > >
> > > >
> > > > I think it's overkill and we don't need such finer granularity.
> > > >
> > > > Also it sounds way too hacky to use BPF to correct the weird
> > > > behaviour from day0.
> > >
> > > What weird behavior? Two accounting mechanisms. Yes I agree but memcgs
> > > with different accounting mechanisms concurrently is also weird.
> >
> > Not that weird given the root cgroup does not allocate sk->sk_memcg
> > and are subject to the global tcp memory accounting. We already have
> > a mixed set of memcgs.
>
> Running workloads in root cgroup is not normal and comes with a warning
> of no isolation provided.
>
> I looked at the patch again to understand the modes you are introducing.
> Initially, I thought the series introduced multiple modes, including an
> option to exclude network memory from memcg accounting. However, if I
> understand correctly, that is not the case—the opt-out applies only to
> the global TCP/UDP accounting. That’s a relief, and I apologize for the
> misunderstanding.
>
> If I’m correct, you need a way to exclude a workload from the global
> TCP/UDP accounting, and currently, memcg serves as a convenient
> abstraction for the workload. Please let me know if I misunderstood.
Correct.
Currently, memcg by itself cannot guarantee that memory allocation for
socket buffer does not fail even when memory.current < memory.max
due to the global protocol limits.
It means we need to increase the global limits to
(bytes of TCP socket buffer in each cgroup) * (number of cgroup)
, which is hard to predict, and I guess that's the reason why you
or Wei set tcp_mem[] to UINT_MAX so that we can ignore the global
limit.
But we should keep tcp_mem[] within a sane range in the first place.
This series allows us to configure memcg limits only and let memcg
guarantee no failure until it fully consumes memory.max.
The point is that memcg should not be affected by the global limits,
and this is orthogonal with the assumption that every workload should
be running under memcg.
>
> Now memcg is one way to represent the workload. Another more natural, at
> least to me, is the core cgroup. Basically cgroup.something interface.
> BPF is yet another option.
>
> To me cgroup seems preferrable but let's see what other memcg & cgroup
> folks think. Also note that for cgroup and memcg the interface will need
> to be hierarchical.
As the root cgroup doesn't have the knob, these combinations are
considered hierarchical:
(parent, child) = (0, 0), (0, 1), (1, 1)
and only the pattern below is not considered hierarchical
(parent, child) = (1, 0)
Let's say we lock the knob at the first socket creation like your
idea above.
If a parent and its child' knobs are (0, 0) and the child creates a
socket, the child memcg is locked as 0. When the parent enables
the knob, we must check all child cgroups as well. Or, we lock
the all parents' knobs when a socket is created in a child cgroup
with knob=0 ? In any cases we need a global lock.
Well, I understand that the hierarchical semantics is preferable
for cgroup but I think it does not resolve any real issue and rather
churns the code unnecessarily.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-23 2:35 ` Kuniyuki Iwashima
@ 2025-07-23 17:28 ` Shakeel Butt
2025-07-23 18:06 ` Kuniyuki Iwashima
0 siblings, 1 reply; 52+ messages in thread
From: Shakeel Butt @ 2025-07-23 17:28 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Eric Dumazet, Michal Koutný, Tejun Heo, David S. Miller,
Jakub Kicinski, Neal Cardwell, Paolo Abeni, Willem de Bruijn,
Matthieu Baerts, Mat Martineau, Johannes Weiner, Michal Hocko,
Roman Gushchin, Andrew Morton, Simon Horman, Geliang Tang,
Muchun Song, Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
Cc Tejun & Michal to get their opinion on memcg vs cgroup vs BPF
options.
On Tue, Jul 22, 2025 at 07:35:52PM -0700, Kuniyuki Iwashima wrote:
[...]
> >
> > Running workloads in root cgroup is not normal and comes with a warning
> > of no isolation provided.
> >
> > I looked at the patch again to understand the modes you are introducing.
> > Initially, I thought the series introduced multiple modes, including an
> > option to exclude network memory from memcg accounting. However, if I
> > understand correctly, that is not the case—the opt-out applies only to
> > the global TCP/UDP accounting. That’s a relief, and I apologize for the
> > misunderstanding.
> >
> > If I’m correct, you need a way to exclude a workload from the global
> > TCP/UDP accounting, and currently, memcg serves as a convenient
> > abstraction for the workload. Please let me know if I misunderstood.
>
> Correct.
>
> Currently, memcg by itself cannot guarantee that memory allocation for
> socket buffer does not fail even when memory.current < memory.max
> due to the global protocol limits.
>
> It means we need to increase the global limits to
>
> (bytes of TCP socket buffer in each cgroup) * (number of cgroup)
>
> , which is hard to predict, and I guess that's the reason why you
> or Wei set tcp_mem[] to UINT_MAX so that we can ignore the global
> limit.
No that was not the reason. The main reason behind max tcp_mem global
limit was it was not needed as memcg should account and limit the
network memory. I think the reason you don't want tcp_mem global limit
unlimited now is you have internal feature to let workloads opt out of
the memcg accounting of network memory which is causing isolation
issues.
>
> But we should keep tcp_mem[] within a sane range in the first place.
>
> This series allows us to configure memcg limits only and let memcg
> guarantee no failure until it fully consumes memory.max.
>
> The point is that memcg should not be affected by the global limits,
> and this is orthogonal with the assumption that every workload should
> be running under memcg.
>
>
> >
> > Now memcg is one way to represent the workload. Another more natural, at
> > least to me, is the core cgroup. Basically cgroup.something interface.
> > BPF is yet another option.
> >
> > To me cgroup seems preferrable but let's see what other memcg & cgroup
> > folks think. Also note that for cgroup and memcg the interface will need
> > to be hierarchical.
>
> As the root cgroup doesn't have the knob, these combinations are
> considered hierarchical:
>
> (parent, child) = (0, 0), (0, 1), (1, 1)
>
> and only the pattern below is not considered hierarchical
>
> (parent, child) = (1, 0)
>
> Let's say we lock the knob at the first socket creation like your
> idea above.
>
> If a parent and its child' knobs are (0, 0) and the child creates a
> socket, the child memcg is locked as 0. When the parent enables
> the knob, we must check all child cgroups as well. Or, we lock
> the all parents' knobs when a socket is created in a child cgroup
> with knob=0 ? In any cases we need a global lock.
>
> Well, I understand that the hierarchical semantics is preferable
> for cgroup but I think it does not resolve any real issue and rather
> churns the code unnecessarily.
All this is implementation detail and I am asking about semantics. More
specifically:
1. Will the root be non-isolated always?
2. If a cgroup is isolated, does it mean all its desendants are
isolated?
3. Will there ever be a reasonable use-case where there is non-isolated
sub-tree under an isolated ancestor?
Please give some thought to the above (and related) questions.
I am still not convinced that memcg is the right home for this opt-out
feature. I have CCed cgroup folks to get their opinion as well.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-23 17:28 ` Shakeel Butt
@ 2025-07-23 18:06 ` Kuniyuki Iwashima
2025-07-25 1:49 ` Jakub Kicinski
0 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-23 18:06 UTC (permalink / raw)
To: Shakeel Butt
Cc: Eric Dumazet, Michal Koutný, Tejun Heo, David S. Miller,
Jakub Kicinski, Neal Cardwell, Paolo Abeni, Willem de Bruijn,
Matthieu Baerts, Mat Martineau, Johannes Weiner, Michal Hocko,
Roman Gushchin, Andrew Morton, Simon Horman, Geliang Tang,
Muchun Song, Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
On Wed, Jul 23, 2025 at 10:28 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> Cc Tejun & Michal to get their opinion on memcg vs cgroup vs BPF
> options.
>
> On Tue, Jul 22, 2025 at 07:35:52PM -0700, Kuniyuki Iwashima wrote:
> [...]
> > >
> > > Running workloads in root cgroup is not normal and comes with a warning
> > > of no isolation provided.
> > >
> > > I looked at the patch again to understand the modes you are introducing.
> > > Initially, I thought the series introduced multiple modes, including an
> > > option to exclude network memory from memcg accounting. However, if I
> > > understand correctly, that is not the case—the opt-out applies only to
> > > the global TCP/UDP accounting. That’s a relief, and I apologize for the
> > > misunderstanding.
> > >
> > > If I’m correct, you need a way to exclude a workload from the global
> > > TCP/UDP accounting, and currently, memcg serves as a convenient
> > > abstraction for the workload. Please let me know if I misunderstood.
> >
> > Correct.
> >
> > Currently, memcg by itself cannot guarantee that memory allocation for
> > socket buffer does not fail even when memory.current < memory.max
> > due to the global protocol limits.
> >
> > It means we need to increase the global limits to
> >
> > (bytes of TCP socket buffer in each cgroup) * (number of cgroup)
> >
> > , which is hard to predict, and I guess that's the reason why you
> > or Wei set tcp_mem[] to UINT_MAX so that we can ignore the global
> > limit.
>
> No that was not the reason. The main reason behind max tcp_mem global
> limit was it was not needed
but the global limit did take place thus you had to set tcp_mem
to unlimited.
> as memcg should account and limit the
> network memory.
> I think the reason you don't want tcp_mem global limit
> unlimited now is
memcg has been subject to the global limit from day 0.
And note that not every process is under memcg with memory.max
configured.
> you have internal feature to let workloads opt out of
> the memcg accounting of network memory which is causing isolation
> issues.
>
> >
> > But we should keep tcp_mem[] within a sane range in the first place.
> >
> > This series allows us to configure memcg limits only and let memcg
> > guarantee no failure until it fully consumes memory.max.
> >
> > The point is that memcg should not be affected by the global limits,
> > and this is orthogonal with the assumption that every workload should
> > be running under memcg.
> >
> >
> > >
> > > Now memcg is one way to represent the workload. Another more natural, at
> > > least to me, is the core cgroup. Basically cgroup.something interface.
> > > BPF is yet another option.
> > >
> > > To me cgroup seems preferrable but let's see what other memcg & cgroup
> > > folks think. Also note that for cgroup and memcg the interface will need
> > > to be hierarchical.
> >
> > As the root cgroup doesn't have the knob, these combinations are
> > considered hierarchical:
> >
> > (parent, child) = (0, 0), (0, 1), (1, 1)
> >
> > and only the pattern below is not considered hierarchical
> >
> > (parent, child) = (1, 0)
> >
> > Let's say we lock the knob at the first socket creation like your
> > idea above.
> >
> > If a parent and its child' knobs are (0, 0) and the child creates a
> > socket, the child memcg is locked as 0. When the parent enables
> > the knob, we must check all child cgroups as well. Or, we lock
> > the all parents' knobs when a socket is created in a child cgroup
> > with knob=0 ? In any cases we need a global lock.
> >
> > Well, I understand that the hierarchical semantics is preferable
> > for cgroup but I think it does not resolve any real issue and rather
> > churns the code unnecessarily.
>
> All this is implementation detail and I am asking about semantics. More
> specifically:
>
> 1. Will the root be non-isolated always?
Yes, because the root cgroup doesn't have memcg.
Also, the knob has CFTYPE_NOT_ON_ROOT.
> 2. If a cgroup is isolated, does it mean all its desendants are
> isolated?
No, but this is because we MUST think about how we handle
the scenario above that (parent, child) = (0,0) becomes (1, 0).
We cannot think about the semantics without implementation
detail. And if we allow such scenario, the hierarchical semantics
is fake and has no meaning.
> 3. Will there ever be a reasonable use-case where there is non-isolated
> sub-tree under an isolated ancestor?
I think no, but again, we need to think about the scenario above,
otherwise, your ideal semantics is just broken.
Also, "no reasonable scenario" does not always mean "we must
prevent the scenario".
If there's nothing harmful, we can just let it be, especially if such
restriction gives nothing andrather hurts performance with no
good reason.
>
> Please give some thought to the above (and related) questions.
Please think about the implementation detail and if its trade-off
(just keeping semantics vs code churn & perf regression) makes
really sense.
>
> I am still not convinced that memcg is the right home for this opt-out
> feature. I have CCed cgroup folks to get their opinion as well.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-23 18:06 ` Kuniyuki Iwashima
@ 2025-07-25 1:49 ` Jakub Kicinski
2025-07-25 18:50 ` Kuniyuki Iwashima
0 siblings, 1 reply; 52+ messages in thread
From: Jakub Kicinski @ 2025-07-25 1:49 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Shakeel Butt, Eric Dumazet, Michal Koutný, Tejun Heo,
David S. Miller, Neal Cardwell, Paolo Abeni, Willem de Bruijn,
Matthieu Baerts, Mat Martineau, Johannes Weiner, Michal Hocko,
Roman Gushchin, Andrew Morton, Simon Horman, Geliang Tang,
Muchun Song, Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
On Wed, 23 Jul 2025 11:06:14 -0700 Kuniyuki Iwashima wrote:
> > 3. Will there ever be a reasonable use-case where there is non-isolated
> > sub-tree under an isolated ancestor?
>
> I think no, but again, we need to think about the scenario above,
> otherwise, your ideal semantics is just broken.
>
> Also, "no reasonable scenario" does not always mean "we must
> prevent the scenario".
>
> If there's nothing harmful, we can just let it be, especially if such
> restriction gives nothing andrather hurts performance with no
> good reason.
Stating the obvious perhaps but it's probably too late in the release
cycle to get enough agreement here to merge the series. So I'll mark
it as Deferred.
While I'm typing, TBH I'm not sure I'm following the arguments about
making the property hierarchical. Since the memory limit gets inherited
I don't understand why the property of being isolated would not.
Either I don't understand the memcg enough, or I don't understand your
intended semantics. Anyway..
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-25 1:49 ` Jakub Kicinski
@ 2025-07-25 18:50 ` Kuniyuki Iwashima
0 siblings, 0 replies; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-25 18:50 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Shakeel Butt, Eric Dumazet, Michal Koutný, Tejun Heo,
David S. Miller, Neal Cardwell, Paolo Abeni, Willem de Bruijn,
Matthieu Baerts, Mat Martineau, Johannes Weiner, Michal Hocko,
Roman Gushchin, Andrew Morton, Simon Horman, Geliang Tang,
Muchun Song, Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
On Thu, Jul 24, 2025 at 6:49 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 23 Jul 2025 11:06:14 -0700 Kuniyuki Iwashima wrote:
> > > 3. Will there ever be a reasonable use-case where there is non-isolated
> > > sub-tree under an isolated ancestor?
> >
> > I think no, but again, we need to think about the scenario above,
> > otherwise, your ideal semantics is just broken.
> >
> > Also, "no reasonable scenario" does not always mean "we must
> > prevent the scenario".
> >
> > If there's nothing harmful, we can just let it be, especially if such
> > restriction gives nothing andrather hurts performance with no
> > good reason.
>
> Stating the obvious perhaps but it's probably too late in the release
> cycle to get enough agreement here to merge the series. So I'll mark
> it as Deferred.
Fair enough.
>
> While I'm typing, TBH I'm not sure I'm following the arguments about
> making the property hierarchical. Since the memory limit gets inherited
> I don't understand why the property of being isolated would not.
> Either I don't understand the memcg enough, or I don't understand your
> intended semantics. Anyway..
Inheriting a config is easy, but keeping the hierarchy complete isn't,
or maybe I'm thinking too hard :S
[root@fedora ~]# mkdir /sys/fs/cgroup/test1
[root@fedora ~]# mkdir /sys/fs/cgroup/test1/test2
[root@fedora ~]# echo +memory > /sys/fs/cgroup/test1/cgroup.subtree_control
[root@fedora ~]# echo 10000 > /sys/fs/cgroup/test1/test2/memory.max
[root@fedora ~]# echo 1000 > /sys/fs/cgroup/test1/memory.max
[ 108.130895] bash invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL),
order=0, oom_score_adj=0
...
[ 108.260164] Out of memory and no killable processes...
[root@fedora ~]# cat /sys/fs/cgroup/test1/test2/memory.max
8192
[root@fedora ~]# cat /sys/fs/cgroup/test1/memory.max
0
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-21 20:35 ` [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
2025-07-22 15:14 ` Shakeel Butt
@ 2025-07-28 16:07 ` Johannes Weiner
2025-07-28 21:41 ` Kuniyuki Iwashima
2025-07-31 2:58 ` Roman Gushchin
2025-07-31 13:38 ` Michal Koutný
3 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2025-07-28 16:07 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> buffers and charge memory to per-protocol global counters pointed to by
> sk->sk_proto->memory_allocated.
>
> When running under a non-root cgroup, this memory is also charged to the
> memcg as sock in memory.stat.
>
> Even when memory usage is controlled by memcg, sockets using such protocols
> are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
>
> This makes it difficult to accurately estimate and configure appropriate
> global limits, especially in multi-tenant environments.
>
> If all workloads were guaranteed to be controlled under memcg, the issue
> could be worked around by setting tcp_mem[0~2] to UINT_MAX.
>
> In reality, this assumption does not always hold, and a single workload
> that opts out of memcg can consume memory up to the global limit,
> becoming a noisy neighbour.
Yes, an uncontrolled cgroup can consume all of a shared resource and
thereby become a noisy neighbor. Why is network memory special?
I assume you have some other mechanisms for curbing things like
filesystem caches, anon memory, swap etc. of such otherwise
uncontrolled groups, and this just happens to be your missing piece.
But at this point, you're operating so far out of the cgroup resource
management model that I don't think it can be reasonably supported.
I hate to say this, but can't you carry this out of tree until the
transition is complete?
I just don't think it makes any sense to have this as a permanent
fixture in a general-purpose container management interface.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-28 16:07 ` Johannes Weiner
@ 2025-07-28 21:41 ` Kuniyuki Iwashima
2025-07-29 14:22 ` Johannes Weiner
0 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-28 21:41 UTC (permalink / raw)
To: Johannes Weiner
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 28, 2025 at 9:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > buffers and charge memory to per-protocol global counters pointed to by
> > sk->sk_proto->memory_allocated.
> >
> > When running under a non-root cgroup, this memory is also charged to the
> > memcg as sock in memory.stat.
> >
> > Even when memory usage is controlled by memcg, sockets using such protocols
> > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> >
> > This makes it difficult to accurately estimate and configure appropriate
> > global limits, especially in multi-tenant environments.
> >
> > If all workloads were guaranteed to be controlled under memcg, the issue
> > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> >
> > In reality, this assumption does not always hold, and a single workload
> > that opts out of memcg can consume memory up to the global limit,
> > becoming a noisy neighbour.
>
> Yes, an uncontrolled cgroup can consume all of a shared resource and
> thereby become a noisy neighbor. Why is network memory special?
>
> I assume you have some other mechanisms for curbing things like
> filesystem caches, anon memory, swap etc. of such otherwise
> uncontrolled groups, and this just happens to be your missing piece.
I think that's the tcp_mem[] knob, limiting tcp mem globally for
the "uncontrolled" cgroup. But we can't use it because the
"controlled" cgroup is also limited by this knob.
If we want to properly control the "controlled" cgroup by its feature
only, we must disable the global limit completely on the host,
meaning we lose the "missing piece".
Currently, there are only two poor choices
1) Use tcp_mem[] but memory allocation could fail even if the
cgroup has available memory
2) Disable tcp_mem[] but uncontrolled cgroup lose seatbelt and
can consume memory up to system limit
but what we really need is
3) Uncontrolled cgroup is limited by tcp_mem[],
AND
for controlled cgroup, memory allocation won't fail if
it has available memory regardless of tcp_mem[]
>
> But at this point, you're operating so far out of the cgroup resource
> management model that I don't think it can be reasonably supported.
I think it's rather operated under the normal cgroup management
model, relying on the configured memory limit for each cgroup.
What's wrong here is we had to set tcp_mem[] to UINT_MAX and
get rid of the seatbelt for uncontrolled cgroup for the management
model.
But this is just because cgroup mem is also charged globally
to TCP, which should not be.
>
> I hate to say this, but can't you carry this out of tree until the
> transition is complete?
>
> I just don't think it makes any sense to have this as a permanent
> fixture in a general-purpose container management interface.
I understand that, and we should eventually fix "1) or 2)" to
just 3), but introducing this change without a knob will break
assumptions in userspace and trigger regression.
cgroup v2 is now widely enabled by major distro, and systemd
creates many processes under non-root cgroups but without
memory limits.
If we had no knob, such processes would suddenly lose the
tcp_mem[] seatbelt and could consume memory up to system
limit.
How about adding the knob's deprecation plan by pr_warn_once()
or something and letting users configure the max properly by
that ?
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-28 21:41 ` Kuniyuki Iwashima
@ 2025-07-29 14:22 ` Johannes Weiner
2025-07-29 19:41 ` Kuniyuki Iwashima
0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2025-07-29 14:22 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Mon, Jul 28, 2025 at 02:41:38PM -0700, Kuniyuki Iwashima wrote:
> On Mon, Jul 28, 2025 at 9:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > > buffers and charge memory to per-protocol global counters pointed to by
> > > sk->sk_proto->memory_allocated.
> > >
> > > When running under a non-root cgroup, this memory is also charged to the
> > > memcg as sock in memory.stat.
> > >
> > > Even when memory usage is controlled by memcg, sockets using such protocols
> > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> > >
> > > This makes it difficult to accurately estimate and configure appropriate
> > > global limits, especially in multi-tenant environments.
> > >
> > > If all workloads were guaranteed to be controlled under memcg, the issue
> > > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> > >
> > > In reality, this assumption does not always hold, and a single workload
> > > that opts out of memcg can consume memory up to the global limit,
> > > becoming a noisy neighbour.
> >
> > Yes, an uncontrolled cgroup can consume all of a shared resource and
> > thereby become a noisy neighbor. Why is network memory special?
> >
> > I assume you have some other mechanisms for curbing things like
> > filesystem caches, anon memory, swap etc. of such otherwise
> > uncontrolled groups, and this just happens to be your missing piece.
>
> I think that's the tcp_mem[] knob, limiting tcp mem globally for
> the "uncontrolled" cgroup. But we can't use it because the
> "controlled" cgroup is also limited by this knob.
No, I was really asking what you do about other types of memory
consumed by such uncontrolled cgroups.
You can't have uncontrolled groups and complain about their resource
consumption.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-29 14:22 ` Johannes Weiner
@ 2025-07-29 19:41 ` Kuniyuki Iwashima
0 siblings, 0 replies; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-29 19:41 UTC (permalink / raw)
To: Johannes Weiner
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
On Tue, Jul 29, 2025 at 7:22 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Jul 28, 2025 at 02:41:38PM -0700, Kuniyuki Iwashima wrote:
> > On Mon, Jul 28, 2025 at 9:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> > > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > > > buffers and charge memory to per-protocol global counters pointed to by
> > > > sk->sk_proto->memory_allocated.
> > > >
> > > > When running under a non-root cgroup, this memory is also charged to the
> > > > memcg as sock in memory.stat.
> > > >
> > > > Even when memory usage is controlled by memcg, sockets using such protocols
> > > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> > > >
> > > > This makes it difficult to accurately estimate and configure appropriate
> > > > global limits, especially in multi-tenant environments.
> > > >
> > > > If all workloads were guaranteed to be controlled under memcg, the issue
> > > > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> > > >
> > > > In reality, this assumption does not always hold, and a single workload
> > > > that opts out of memcg can consume memory up to the global limit,
> > > > becoming a noisy neighbour.
> > >
> > > Yes, an uncontrolled cgroup can consume all of a shared resource and
> > > thereby become a noisy neighbor. Why is network memory special?
> > >
> > > I assume you have some other mechanisms for curbing things like
> > > filesystem caches, anon memory, swap etc. of such otherwise
> > > uncontrolled groups, and this just happens to be your missing piece.
> >
> > I think that's the tcp_mem[] knob, limiting tcp mem globally for
> > the "uncontrolled" cgroup. But we can't use it because the
> > "controlled" cgroup is also limited by this knob.
>
> No, I was really asking what you do about other types of memory
> consumed by such uncontrolled cgroups.
>
> You can't have uncontrolled groups and complain about their resource
> consumption.
Only 10% of physical memory is allowed to be used globally for TCP.
How is it supposed to work if we don't enforce limits on uncontrolled
cgroups ?
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-21 20:35 ` [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
2025-07-22 15:14 ` Shakeel Butt
2025-07-28 16:07 ` Johannes Weiner
@ 2025-07-31 2:58 ` Roman Gushchin
2025-07-31 13:38 ` Michal Koutný
3 siblings, 0 replies; 52+ messages in thread
From: Roman Gushchin @ 2025-07-31 2:58 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Shakeel Butt, Andrew Morton,
Simon Horman, Geliang Tang, Muchun Song, Kuniyuki Iwashima,
netdev, mptcp, cgroups, linux-mm
Kuniyuki Iwashima <kuniyu@google.com> writes:
> Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> buffers and charge memory to per-protocol global counters pointed to by
> sk->sk_proto->memory_allocated.
>
> When running under a non-root cgroup, this memory is also charged to the
> memcg as sock in memory.stat.
>
> Even when memory usage is controlled by memcg, sockets using such protocols
> are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
>
> This makes it difficult to accurately estimate and configure appropriate
> global limits, especially in multi-tenant environments.
>
> If all workloads were guaranteed to be controlled under memcg, the issue
> could be worked around by setting tcp_mem[0~2] to UINT_MAX.
>
> In reality, this assumption does not always hold, and a single workload
> that opts out of memcg can consume memory up to the global limit,
> becoming a noisy neighbour.
>
> Let's decouple memcg from the global per-protocol memory accounting.
>
> This simplifies memcg configuration while keeping the global limits
> within a reasonable range.
I don't think it should be a memcg feature. In fact, it doesn't have
much to do with cgroups at all (it's not hierarchical, it doesn't
control the resource allocation, and in the end it controls an
alternative to memory cgroups memory accounting system).
Instead, it can be a per-process prctl option.
(Assuming the feature is really needed - I'm also curious why some
processes have to be excluded from the memcg accounting - it sounds like
generally a bad idea).
Thanks
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-21 20:35 ` [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
` (2 preceding siblings ...)
2025-07-31 2:58 ` Roman Gushchin
@ 2025-07-31 13:38 ` Michal Koutný
2025-07-31 23:51 ` Kuniyuki Iwashima
3 siblings, 1 reply; 52+ messages in thread
From: Michal Koutný @ 2025-07-31 13:38 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton, Simon Horman, Geliang Tang, Muchun Song,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
[-- Attachment #1: Type: text/plain, Size: 1852 bytes --]
On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima <kuniyu@google.com> wrote:
> Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> buffers and charge memory to per-protocol global counters pointed to by
> sk->sk_proto->memory_allocated.
>
> When running under a non-root cgroup, this memory is also charged to the
> memcg as sock in memory.stat.
>
> Even when memory usage is controlled by memcg, sockets using such protocols
> are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
IIUC the envisioned use case is that some cgroups feed from global
resource and some from their own limit.
It means the admin knows both:
a) how to configure individual cgroup,
b) how to configure global limit (for the rest).
So why cannot they stick to a single model only?
> This makes it difficult to accurately estimate and configure appropriate
> global limits, especially in multi-tenant environments.
>
> If all workloads were guaranteed to be controlled under memcg, the issue
> could be worked around by setting tcp_mem[0~2] to UINT_MAX.
>
> In reality, this assumption does not always hold, and a single workload
> that opts out of memcg can consume memory up to the global limit,
> becoming a noisy neighbour.
That doesn't like a good idea to remove limits from possibly noisy
units.
> Let's decouple memcg from the global per-protocol memory accounting.
>
> This simplifies memcg configuration while keeping the global limits
> within a reasonable range.
I think this is a configuration issue only, i.e. instead of preserving
the global limit because of _some_ memcgs, the configuration management
could have a default memcg limit that is substituted to those memcgs so
that there's no risk of runaways even in absence of global limit.
Regards,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 11/13] net-memcg: Add memory.socket_isolated knob.
2025-07-21 20:35 ` [PATCH v1 net-next 11/13] net-memcg: Add memory.socket_isolated knob Kuniyuki Iwashima
2025-07-22 15:00 ` Eric Dumazet
@ 2025-07-31 13:39 ` Michal Koutný
1 sibling, 0 replies; 52+ messages in thread
From: Michal Koutný @ 2025-07-31 13:39 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton, Simon Horman, Geliang Tang, Muchun Song,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
[-- Attachment #1: Type: text/plain, Size: 1399 bytes --]
Hello Kuniyuki.
On Mon, Jul 21, 2025 at 08:35:30PM +0000, Kuniyuki Iwashima <kuniyu@google.com> wrote:
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1878,6 +1878,22 @@ The following nested keys are defined.
> Shows pressure stall information for memory. See
> :ref:`Documentation/accounting/psi.rst <psi>` for details.
>
> + memory.socket_isolated
> + A read-write single value file which exists on non-root cgroups.
> + The default value is "0".
Such attributes don't fit well into hierarchy.
What are expectations in non-root non-leaf cgroups?
Also the global limit is not so much different from a memcg limit
configured on ancestors. This provision thus looks like handling only
one particular case.
> +
> + Some networking protocols (e.g., TCP, UDP) implement their own memory
> + accounting for socket buffers.
> +
> + This memory is also charged to a non-root cgroup as sock in memory.stat.
> +
> + Since per-protocol limits such as /proc/sys/net/ipv4/tcp_mem and
> + /proc/sys/net/ipv4/udp_mem are global, memory allocation for socket
> + buffers may fail even when the cgroup has available memory.
> +
> + Sockets created with socket_isolated set to 1 are no longer subject
> + to these global protocol limits.
What happens when it's changed during lifetime of cgroup?
Thanks,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-31 13:38 ` Michal Koutný
@ 2025-07-31 23:51 ` Kuniyuki Iwashima
2025-08-01 7:00 ` Michal Koutný
0 siblings, 1 reply; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-31 23:51 UTC (permalink / raw)
To: Michal Koutný
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton, Simon Horman, Geliang Tang, Muchun Song,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
On Thu, Jul 31, 2025 at 6:39 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima <kuniyu@google.com> wrote:
> > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > buffers and charge memory to per-protocol global counters pointed to by
> > sk->sk_proto->memory_allocated.
> >
> > When running under a non-root cgroup, this memory is also charged to the
> > memcg as sock in memory.stat.
> >
> > Even when memory usage is controlled by memcg, sockets using such protocols
> > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
>
> IIUC the envisioned use case is that some cgroups feed from global
> resource and some from their own limit.
> It means the admin knows both:
> a) how to configure individual cgroup,
> b) how to configure global limit (for the rest).
> So why cannot they stick to a single model only?
>
> > This makes it difficult to accurately estimate and configure appropriate
> > global limits, especially in multi-tenant environments.
> >
> > If all workloads were guaranteed to be controlled under memcg, the issue
> > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> >
> > In reality, this assumption does not always hold, and a single workload
> > that opts out of memcg can consume memory up to the global limit,
> > becoming a noisy neighbour.
>
> That doesn't like a good idea to remove limits from possibly noisy
> units.
>
> > Let's decouple memcg from the global per-protocol memory accounting.
> >
> > This simplifies memcg configuration while keeping the global limits
> > within a reasonable range.
>
> I think this is a configuration issue only, i.e. instead of preserving
> the global limit because of _some_ memcgs, the configuration management
> could have a default memcg limit that is substituted to those memcgs so
> that there's no risk of runaways even in absence of global limit.
Doesn't that end up implementing another tcp_mem[] which now
enforce limits on uncontrolled cgroups (memory.max == max) ?
Or it will simply end up with the system-wide OOM killer ?
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-07-31 23:51 ` Kuniyuki Iwashima
@ 2025-08-01 7:00 ` Michal Koutný
2025-08-01 16:27 ` Kuniyuki Iwashima
0 siblings, 1 reply; 52+ messages in thread
From: Michal Koutný @ 2025-08-01 7:00 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton, Simon Horman, Geliang Tang, Muchun Song,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
[-- Attachment #1: Type: text/plain, Size: 731 bytes --]
On Thu, Jul 31, 2025 at 04:51:43PM -0700, Kuniyuki Iwashima <kuniyu@google.com> wrote:
> Doesn't that end up implementing another tcp_mem[] which now
> enforce limits on uncontrolled cgroups (memory.max == max) ?
> Or it will simply end up with the system-wide OOM killer ?
I meant to rely on use the exisiting mem_cgroup_charge_skmem(), i.e.
there'd be always memory.max < max (ensured by the configuring agent).
But you're right the OOM _may_ be global if the limit is too loose.
Actually, as I think about it, another configuration option would be to
reorganize the memcg tree and put all non-isolated memcgs under one
ancestor and set its memory.max limit (so that it's shared among them
like the global limit).
HTH,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.
2025-08-01 7:00 ` Michal Koutný
@ 2025-08-01 16:27 ` Kuniyuki Iwashima
0 siblings, 0 replies; 52+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-01 16:27 UTC (permalink / raw)
To: Michal Koutný
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton, Simon Horman, Geliang Tang, Muchun Song,
Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm
On Fri, Aug 1, 2025 at 12:00 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Thu, Jul 31, 2025 at 04:51:43PM -0700, Kuniyuki Iwashima <kuniyu@google.com> wrote:
> > Doesn't that end up implementing another tcp_mem[] which now
> > enforce limits on uncontrolled cgroups (memory.max == max) ?
> > Or it will simply end up with the system-wide OOM killer ?
>
> I meant to rely on use the exisiting mem_cgroup_charge_skmem(), i.e.
> there'd be always memory.max < max (ensured by the configuring agent).
> But you're right the OOM _may_ be global if the limit is too loose.
>
> Actually, as I think about it, another configuration option would be to
> reorganize the memcg tree and put all non-isolated memcgs under one
> ancestor and set its memory.max limit (so that it's shared among them
> like the global limit).
Interesting. Is it still possible if other controllers are configured
differently and form a hierarchy ? It sounds cgroup-v1-ish.
Or preparing an independent fake memcg for non-isolated socket
and tying it to sk->sk_memcg could be an option ?
The drawback of the option is that socket is not charged to each
memcg and we cannot monitor the usage via memory.stat:sock
and make it a bit difficult to configure memory.max based on it.
Another idea that I have is get rid of the knob and only allow
decoupling memcg from TCP mem accounting only for controlled
cgroup.
This makes it possible to configure memcg by memory.max only
but does not add any change for uncontrolled cgroup from the
current situation.
---8<---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 85decc4319f9..6d7084a32b12 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5102,7 +5102,8 @@ static void mem_cgroup_sk_set(struct sock *sk,
const struct mem_cgroup *memcg)
{
unsigned long val = (unsigned long)memcg;
- val |= READ_ONCE(memcg->socket_isolated);
+ if (memcg->memory.max != PAGE_COUNTER_MAX)
+ val |= MEMCG_SOCK_ISOLATED;
sk->sk_memcg = (struct mem_cgroup *)val;
}
---8<---
^ permalink raw reply related [flat|nested] 52+ messages in thread
end of thread, other threads:[~2025-08-01 16:27 UTC | newest]
Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-21 20:35 [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
2025-07-21 20:35 ` [PATCH v1 net-next 01/13] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n Kuniyuki Iwashima
2025-07-22 14:30 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 02/13] mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready() Kuniyuki Iwashima
2025-07-22 14:33 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 03/13] tcp: Simplify error path in inet_csk_accept() Kuniyuki Iwashima
2025-07-22 14:34 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 04/13] net: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV Kuniyuki Iwashima
2025-07-22 14:37 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 05/13] net: Clean up __sk_mem_raise_allocated() Kuniyuki Iwashima
2025-07-22 14:38 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 06/13] net-memcg: Introduce mem_cgroup_from_sk() Kuniyuki Iwashima
2025-07-22 14:39 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 07/13] net-memcg: Introduce mem_cgroup_sk_enabled() Kuniyuki Iwashima
2025-07-22 14:40 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 08/13] net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge() Kuniyuki Iwashima
2025-07-22 14:56 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 09/13] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure() Kuniyuki Iwashima
2025-07-22 14:58 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 10/13] net: Define sk_memcg under CONFIG_MEMCG Kuniyuki Iwashima
2025-07-22 14:58 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 11/13] net-memcg: Add memory.socket_isolated knob Kuniyuki Iwashima
2025-07-22 15:00 ` Eric Dumazet
2025-07-31 13:39 ` Michal Koutný
2025-07-21 20:35 ` [PATCH v1 net-next 12/13] net-memcg: Store memcg->socket_isolated in sk->sk_memcg Kuniyuki Iwashima
2025-07-22 15:02 ` Eric Dumazet
2025-07-21 20:35 ` [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
2025-07-22 15:14 ` Shakeel Butt
2025-07-22 15:24 ` Eric Dumazet
2025-07-22 15:52 ` Shakeel Butt
2025-07-22 18:18 ` Kuniyuki Iwashima
2025-07-22 18:47 ` Shakeel Butt
2025-07-22 19:03 ` Kuniyuki Iwashima
2025-07-22 19:56 ` Shakeel Butt
2025-07-22 21:59 ` Kuniyuki Iwashima
2025-07-23 0:29 ` Shakeel Butt
2025-07-23 2:35 ` Kuniyuki Iwashima
2025-07-23 17:28 ` Shakeel Butt
2025-07-23 18:06 ` Kuniyuki Iwashima
2025-07-25 1:49 ` Jakub Kicinski
2025-07-25 18:50 ` Kuniyuki Iwashima
2025-07-28 16:07 ` Johannes Weiner
2025-07-28 21:41 ` Kuniyuki Iwashima
2025-07-29 14:22 ` Johannes Weiner
2025-07-29 19:41 ` Kuniyuki Iwashima
2025-07-31 2:58 ` Roman Gushchin
2025-07-31 13:38 ` Michal Koutný
2025-07-31 23:51 ` Kuniyuki Iwashima
2025-08-01 7:00 ` Michal Koutný
2025-08-01 16:27 ` Kuniyuki Iwashima
2025-07-22 15:04 ` [PATCH v1 net-next 00/13] net-memcg: Allow decoupling memcg from sk->sk_prot->memory_allocated Shakeel Butt
2025-07-22 15:34 ` Eric Dumazet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).