netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 net-next 00/12] net-memcg: Decouple controlled memcg from sk->sk_prot->memory_allocated.
@ 2025-08-12 17:58 Kuniyuki Iwashima
  2025-08-12 17:58 ` [PATCH v3 net-next 01/12] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n Kuniyuki Iwashima
                   ` (11 more replies)
  0 siblings, 12 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-12 17:58 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Michal Koutný, Tejun Heo
  Cc: Simon Horman, Geliang Tang, Muchun Song, Mina Almasry,
	Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, mptcp, cgroups,
	linux-mm

Some protocols (e.g., TCP, UDP) have their own memory accounting for
socket buffers and charge memory to global per-protocol counters such
as /proc/net/ipv4/tcp_mem.

When running under a non-root cgroup, this memory is also charged to
the memcg as sock in memory.stat.

Sockets of such protocols are still subject to the global limits,
thus affected by a noisy neighbour outside cgroup.

This makes it difficult to accurately estimate and configure appropriate
global limits.

If all workloads were guaranteed to be controlled under memcg, the issue
can be worked around by setting tcp_mem[0~2] to UINT_MAX.

However, this assumption does not always hold, and processes that belong
to the root cgroup or opt out of memcg can consume memory up to the global
limit, which is problematic.

This series decouples memcg from the global memory accounting if its
memory.max is not "max".  This simplifies the memcg configuration while
keeping the global limits within a reasonable range, which is only 10% of
the physical memory by default.

Overview of the series:

  patch 1 is a bug fix for MPTCP
  patch 2 ~ 9 move sk->sk_memcg accesses to a single place
  patch 10 moves sk_memcg under CONFIG_MEMCG
  patch 11 stores a flag in the lowest bit of sk->sk_memcg
  patch 12 decouples memcg from sk_prot->memory_allocated based on the flag


Changes:
  v3:
    * Patch 12
      * Fix build failrue for kTLS (include <net/proto_memory.h>)

  v2: https://lore.kernel.org/netdev/20250811173116.2829786-1-kuniyu@google.com/
    * Remove per-memcg knob
    * Patch 11
      * Set flag on sk_memcg based on memory.max
    * Patch 12
      * Add sk_should_enter_memory_pressure() and cover
        tcp_enter_memory_pressure() calls
      * Update examples in changelog

  v1: https://lore.kernel.org/netdev/20250721203624.3807041-1-kuniyu@google.com/


Kuniyuki Iwashima (12):
  mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n.
  mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready().
  tcp: Simplify error path in inet_csk_accept().
  net: Call trace_sock_exceed_buf_limit() for memcg failure with
    SK_MEM_RECV.
  net: Clean up __sk_mem_raise_allocated().
  net-memcg: Introduce mem_cgroup_from_sk().
  net-memcg: Introduce mem_cgroup_sk_enabled().
  net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge().
  net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure().
  net: Define sk_memcg under CONFIG_MEMCG.
  net-memcg: Store MEMCG_SOCK_ISOLATED in sk->sk_memcg.
  net-memcg: Decouple controlled memcg from global protocol memory
    accounting.

 include/linux/memcontrol.h      | 45 +++++++++-------
 include/net/proto_memory.h      | 15 ++++--
 include/net/sock.h              | 67 +++++++++++++++++++++++
 include/net/tcp.h               | 10 ++--
 mm/memcontrol.c                 | 48 +++++++++++++----
 net/core/sock.c                 | 94 +++++++++++++++++++++------------
 net/ipv4/inet_connection_sock.c | 35 +++++++-----
 net/ipv4/tcp.c                  |  3 +-
 net/ipv4/tcp_output.c           | 13 +++--
 net/mptcp/protocol.c            |  4 +-
 net/mptcp/protocol.h            |  4 +-
 net/mptcp/subflow.c             | 11 ++--
 net/tls/tls_device.c            |  4 +-
 13 files changed, 254 insertions(+), 99 deletions(-)

-- 
2.51.0.rc0.205.g4a044479a3-goog


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 01/12] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n.
  2025-08-12 17:58 [PATCH v3 net-next 00/12] net-memcg: Decouple controlled memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
@ 2025-08-12 17:58 ` Kuniyuki Iwashima
  2025-08-13  8:54   ` Matthieu Baerts
  2025-08-14 12:30   ` Michal Koutný
  2025-08-12 17:58 ` [PATCH v3 net-next 02/12] mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready() Kuniyuki Iwashima
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-12 17:58 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Michal Koutný, Tejun Heo
  Cc: Simon Horman, Geliang Tang, Muchun Song, Mina Almasry,
	Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, mptcp, cgroups,
	linux-mm

When sk_alloc() allocates a socket, mem_cgroup_sk_alloc() sets
sk->sk_memcg based on the current task.

MPTCP subflow socket creation is triggered from userspace or
an in-kernel worker.

In the latter case, sk->sk_memcg is not what we want.  So, we fix
it up from the parent socket's sk->sk_memcg in mptcp_attach_cgroup().

Although the code is placed under #ifdef CONFIG_MEMCG, it is buried
under #ifdef CONFIG_SOCK_CGROUP_DATA.

The two configs are orthogonal.  If CONFIG_MEMCG is enabled without
CONFIG_SOCK_CGROUP_DATA, the subflow's memory usage is not charged
correctly.

Let's move the code out of the wrong ifdef guard.

Note that sk->sk_memcg is freed in sk_prot_free() and the parent
sk holds the refcnt of memcg->css here, so we don't need to use
css_tryget().

Fixes: 3764b0c5651e3 ("mptcp: attach subflow socket to parent cgroup")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/memcontrol.h |  6 ++++++
 mm/memcontrol.c            | 10 ++++++++++
 net/mptcp/subflow.c        | 11 +++--------
 3 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 785173aa0739..25921fbec685 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1604,6 +1604,7 @@ extern struct static_key_false memcg_sockets_enabled_key;
 #define mem_cgroup_sockets_enabled static_branch_unlikely(&memcg_sockets_enabled_key)
 void mem_cgroup_sk_alloc(struct sock *sk);
 void mem_cgroup_sk_free(struct sock *sk);
+void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk);
 
 #if BITS_PER_LONG < 64
 static inline void mem_cgroup_set_socket_pressure(struct mem_cgroup *memcg)
@@ -1661,6 +1662,11 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg);
 #define mem_cgroup_sockets_enabled 0
 static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
 static inline void mem_cgroup_sk_free(struct sock *sk) { };
+
+static inline void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk)
+{
+}
+
 static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 {
 	return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8dd7fbed5a94..08c6e06750ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5024,6 +5024,16 @@ void mem_cgroup_sk_free(struct sock *sk)
 		css_put(&sk->sk_memcg->css);
 }
 
+void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk)
+{
+	if (sk->sk_memcg == newsk->sk_memcg)
+		return;
+
+	mem_cgroup_sk_free(newsk);
+	css_get(&sk->sk_memcg->css);
+	newsk->sk_memcg = sk->sk_memcg;
+}
+
 /**
  * mem_cgroup_charge_skmem - charge socket memory
  * @memcg: memcg to charge
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 3f1b62a9fe88..6fb635a95baf 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1717,19 +1717,14 @@ static void mptcp_attach_cgroup(struct sock *parent, struct sock *child)
 	/* only the additional subflows created by kworkers have to be modified */
 	if (cgroup_id(sock_cgroup_ptr(parent_skcd)) !=
 	    cgroup_id(sock_cgroup_ptr(child_skcd))) {
-#ifdef CONFIG_MEMCG
-		struct mem_cgroup *memcg = parent->sk_memcg;
-
-		mem_cgroup_sk_free(child);
-		if (memcg && css_tryget(&memcg->css))
-			child->sk_memcg = memcg;
-#endif /* CONFIG_MEMCG */
-
 		cgroup_sk_free(child_skcd);
 		*child_skcd = *parent_skcd;
 		cgroup_sk_clone(child_skcd);
 	}
 #endif /* CONFIG_SOCK_CGROUP_DATA */
+
+	if (mem_cgroup_sockets_enabled && parent->sk_memcg)
+		mem_cgroup_sk_inherit(parent, child);
 }
 
 static void mptcp_subflow_ops_override(struct sock *ssk)
-- 
2.51.0.rc0.205.g4a044479a3-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 02/12] mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready().
  2025-08-12 17:58 [PATCH v3 net-next 00/12] net-memcg: Decouple controlled memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
  2025-08-12 17:58 ` [PATCH v3 net-next 01/12] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n Kuniyuki Iwashima
@ 2025-08-12 17:58 ` Kuniyuki Iwashima
  2025-08-13  8:54   ` Matthieu Baerts
  2025-08-12 17:58 ` [PATCH v3 net-next 03/12] tcp: Simplify error path in inet_csk_accept() Kuniyuki Iwashima
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-12 17:58 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Michal Koutný, Tejun Heo
  Cc: Simon Horman, Geliang Tang, Muchun Song, Mina Almasry,
	Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, mptcp, cgroups,
	linux-mm

Some conditions used in mptcp_epollin_ready() are the same as
tcp_under_memory_pressure().

We will modify tcp_under_memory_pressure() in the later patch.

Let's use tcp_under_memory_pressure() instead.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 net/mptcp/protocol.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index b15d7fab5c4b..a1787a1344ac 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -788,9 +788,7 @@ static inline bool mptcp_epollin_ready(const struct sock *sk)
 	 * as it can always coalesce them
 	 */
 	return (data_avail >= sk->sk_rcvlowat) ||
-	       (mem_cgroup_sockets_enabled && sk->sk_memcg &&
-		mem_cgroup_under_socket_pressure(sk->sk_memcg)) ||
-	       READ_ONCE(tcp_memory_pressure);
+		tcp_under_memory_pressure(sk);
 }
 
 int mptcp_set_rcvlowat(struct sock *sk, int val);
-- 
2.51.0.rc0.205.g4a044479a3-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 03/12] tcp: Simplify error path in inet_csk_accept().
  2025-08-12 17:58 [PATCH v3 net-next 00/12] net-memcg: Decouple controlled memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
  2025-08-12 17:58 ` [PATCH v3 net-next 01/12] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n Kuniyuki Iwashima
  2025-08-12 17:58 ` [PATCH v3 net-next 02/12] mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready() Kuniyuki Iwashima
@ 2025-08-12 17:58 ` Kuniyuki Iwashima
  2025-08-12 17:58 ` [PATCH v3 net-next 04/12] net: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV Kuniyuki Iwashima
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-12 17:58 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Michal Koutný, Tejun Heo
  Cc: Simon Horman, Geliang Tang, Muchun Song, Mina Almasry,
	Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, mptcp, cgroups,
	linux-mm

When an error occurs in inet_csk_accept(), what we should do is
only call release_sock() and set the errno to arg->err.

But the path jumps to another label, which introduces unnecessary
initialisation and tests for newsk.

Let's simplify the error path and remove the redundant NULL
checks for newsk.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/inet_connection_sock.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 1e2df51427fe..724bd9ed6cd4 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -706,9 +706,9 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
 		spin_unlock_bh(&queue->fastopenq.lock);
 	}
 
-out:
 	release_sock(sk);
-	if (newsk && mem_cgroup_sockets_enabled) {
+
+	if (mem_cgroup_sockets_enabled) {
 		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
 		int amt = 0;
 
@@ -732,18 +732,17 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
 
 		release_sock(newsk);
 	}
+
 	if (req)
 		reqsk_put(req);
 
-	if (newsk)
-		inet_init_csk_locks(newsk);
-
+	inet_init_csk_locks(newsk);
 	return newsk;
+
 out_err:
-	newsk = NULL;
-	req = NULL;
+	release_sock(sk);
 	arg->err = error;
-	goto out;
+	return NULL;
 }
 EXPORT_SYMBOL(inet_csk_accept);
 
-- 
2.51.0.rc0.205.g4a044479a3-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 04/12] net: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV.
  2025-08-12 17:58 [PATCH v3 net-next 00/12] net-memcg: Decouple controlled memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2025-08-12 17:58 ` [PATCH v3 net-next 03/12] tcp: Simplify error path in inet_csk_accept() Kuniyuki Iwashima
@ 2025-08-12 17:58 ` Kuniyuki Iwashima
  2025-08-12 17:58 ` [PATCH v3 net-next 05/12] net: Clean up __sk_mem_raise_allocated() Kuniyuki Iwashima
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-12 17:58 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Michal Koutný, Tejun Heo
  Cc: Simon Horman, Geliang Tang, Muchun Song, Mina Almasry,
	Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, mptcp, cgroups,
	linux-mm

Initially, trace_sock_exceed_buf_limit() was invoked when
__sk_mem_raise_allocated() failed due to the memcg limit or the
global limit.

However, commit d6f19938eb031 ("net: expose sk wmem in
sock_exceed_buf_limit tracepoint") somehow suppressed the event
only when memcg failed to charge for SK_MEM_RECV, although the
memcg failure for SK_MEM_SEND still triggers the event.

Let's restore the event for SK_MEM_RECV.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 net/core/sock.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 7c26ec8dce63..380bc1aa6982 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3354,8 +3354,7 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 		}
 	}
 
-	if (kind == SK_MEM_SEND || (kind == SK_MEM_RECV && charged))
-		trace_sock_exceed_buf_limit(sk, prot, allocated, kind);
+	trace_sock_exceed_buf_limit(sk, prot, allocated, kind);
 
 	sk_memory_allocated_sub(sk, amt);
 
-- 
2.51.0.rc0.205.g4a044479a3-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 05/12] net: Clean up __sk_mem_raise_allocated().
  2025-08-12 17:58 [PATCH v3 net-next 00/12] net-memcg: Decouple controlled memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2025-08-12 17:58 ` [PATCH v3 net-next 04/12] net: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV Kuniyuki Iwashima
@ 2025-08-12 17:58 ` Kuniyuki Iwashima
  2025-08-12 17:58 ` [PATCH v3 net-next 06/12] net-memcg: Introduce mem_cgroup_from_sk() Kuniyuki Iwashima
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-12 17:58 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Michal Koutný, Tejun Heo
  Cc: Simon Horman, Geliang Tang, Muchun Song, Mina Almasry,
	Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, mptcp, cgroups,
	linux-mm

In __sk_mem_raise_allocated(), charged is initialised as true due
to the weird condition removed in the previous patch.

It makes the variable unreliable by itself, so we have to check
another variable, memcg, in advance.

Also, we will factorise the common check below for memcg later.

    if (mem_cgroup_sockets_enabled && sk->sk_memcg)

As a prep, let's initialise charged as false and memcg as NULL.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 net/core/sock.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 380bc1aa6982..000940ecf360 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3263,15 +3263,16 @@ EXPORT_SYMBOL(sk_wait_data);
  */
 int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
-	struct mem_cgroup *memcg = mem_cgroup_sockets_enabled ? sk->sk_memcg : NULL;
 	struct proto *prot = sk->sk_prot;
-	bool charged = true;
+	struct mem_cgroup *memcg = NULL;
+	bool charged = false;
 	long allocated;
 
 	sk_memory_allocated_add(sk, amt);
 	allocated = sk_memory_allocated(sk);
 
-	if (memcg) {
+	if (mem_cgroup_sockets_enabled && sk->sk_memcg) {
+		memcg = sk->sk_memcg;
 		charged = mem_cgroup_charge_skmem(memcg, amt, gfp_memcg_charge());
 		if (!charged)
 			goto suppress_allocation;
@@ -3358,7 +3359,7 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 
 	sk_memory_allocated_sub(sk, amt);
 
-	if (memcg && charged)
+	if (charged)
 		mem_cgroup_uncharge_skmem(memcg, amt);
 
 	return 0;
-- 
2.51.0.rc0.205.g4a044479a3-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 06/12] net-memcg: Introduce mem_cgroup_from_sk().
  2025-08-12 17:58 [PATCH v3 net-next 00/12] net-memcg: Decouple controlled memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2025-08-12 17:58 ` [PATCH v3 net-next 05/12] net: Clean up __sk_mem_raise_allocated() Kuniyuki Iwashima
@ 2025-08-12 17:58 ` Kuniyuki Iwashima
  2025-08-13  1:44   ` Roman Gushchin
  2025-08-12 17:58 ` [PATCH v3 net-next 07/12] net-memcg: Introduce mem_cgroup_sk_enabled() Kuniyuki Iwashima
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-12 17:58 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Michal Koutný, Tejun Heo
  Cc: Simon Horman, Geliang Tang, Muchun Song, Mina Almasry,
	Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, mptcp, cgroups,
	linux-mm

We will store a flag in the lowest bit of sk->sk_memcg.

Then, directly dereferencing sk->sk_memcg will be illegal, and we
do not want to allow touching the raw sk->sk_memcg in many places.

Let's introduce mem_cgroup_from_sk().

Other places accessing the raw sk->sk_memcg will be converted later.

Note that we cannot define the helper as an inline function in
memcontrol.h as we cannot access any fields of struct sock there
due to circular dependency, so it is placed in sock.h.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sock.h              | 12 ++++++++++++
 mm/memcontrol.c                 | 14 +++++++++-----
 net/ipv4/inet_connection_sock.c |  2 +-
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index c8a4b283df6f..811f95ea8d00 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2594,6 +2594,18 @@ static inline gfp_t gfp_memcg_charge(void)
 	return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
 }
 
+#ifdef CONFIG_MEMCG
+static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
+{
+	return sk->sk_memcg;
+}
+#else
+static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
+{
+	return NULL;
+}
+#endif
+
 static inline long sock_rcvtimeo(const struct sock *sk, bool noblock)
 {
 	return noblock ? 0 : READ_ONCE(sk->sk_rcvtimeo);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 08c6e06750ac..2db7df32fd7c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5020,18 +5020,22 @@ void mem_cgroup_sk_alloc(struct sock *sk)
 
 void mem_cgroup_sk_free(struct sock *sk)
 {
-	if (sk->sk_memcg)
-		css_put(&sk->sk_memcg->css);
+	struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
+
+	if (memcg)
+		css_put(&memcg->css);
 }
 
 void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk)
 {
-	if (sk->sk_memcg == newsk->sk_memcg)
+	struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
+
+	if (memcg == mem_cgroup_from_sk(newsk))
 		return;
 
 	mem_cgroup_sk_free(newsk);
-	css_get(&sk->sk_memcg->css);
-	newsk->sk_memcg = sk->sk_memcg;
+	css_get(&memcg->css);
+	newsk->sk_memcg = memcg;
 }
 
 /**
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 724bd9ed6cd4..93569bbe00f4 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -718,7 +718,7 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
 		lock_sock(newsk);
 
 		mem_cgroup_sk_alloc(newsk);
-		if (newsk->sk_memcg) {
+		if (mem_cgroup_from_sk(newsk)) {
 			/* The socket has not been accepted yet, no need
 			 * to look at newsk->sk_wmem_queued.
 			 */
-- 
2.51.0.rc0.205.g4a044479a3-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 07/12] net-memcg: Introduce mem_cgroup_sk_enabled().
  2025-08-12 17:58 [PATCH v3 net-next 00/12] net-memcg: Decouple controlled memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (5 preceding siblings ...)
  2025-08-12 17:58 ` [PATCH v3 net-next 06/12] net-memcg: Introduce mem_cgroup_from_sk() Kuniyuki Iwashima
@ 2025-08-12 17:58 ` Kuniyuki Iwashima
  2025-08-13  1:46   ` Roman Gushchin
  2025-08-12 17:58 ` [PATCH v3 net-next 08/12] net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge() Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-12 17:58 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Michal Koutný, Tejun Heo
  Cc: Simon Horman, Geliang Tang, Muchun Song, Mina Almasry,
	Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, mptcp, cgroups,
	linux-mm

The socket memcg feature is enabled by a static key and
only works for non-root cgroup.

We check both conditions in many places.

Let's factorise it as a helper function.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 include/net/proto_memory.h |  2 +-
 include/net/sock.h         | 10 ++++++++++
 include/net/tcp.h          |  2 +-
 net/core/sock.c            |  6 +++---
 net/ipv4/tcp_output.c      |  2 +-
 net/mptcp/subflow.c        |  2 +-
 6 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/net/proto_memory.h b/include/net/proto_memory.h
index a6ab2f4f5e28..859e63de81c4 100644
--- a/include/net/proto_memory.h
+++ b/include/net/proto_memory.h
@@ -31,7 +31,7 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
 	if (!sk->sk_prot->memory_pressure)
 		return false;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
+	if (mem_cgroup_sk_enabled(sk) &&
 	    mem_cgroup_under_socket_pressure(sk->sk_memcg))
 		return true;
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 811f95ea8d00..3efdf680401d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2599,11 +2599,21 @@ static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
 {
 	return sk->sk_memcg;
 }
+
+static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
+{
+	return mem_cgroup_sockets_enabled && mem_cgroup_from_sk(sk);
+}
 #else
 static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
 {
 	return NULL;
 }
+
+static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
+{
+	return false;
+}
 #endif
 
 static inline long sock_rcvtimeo(const struct sock *sk, bool noblock)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 526a26e7a150..9f01b6be6444 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -275,7 +275,7 @@ extern unsigned long tcp_memory_pressure;
 /* optimized version of sk_under_memory_pressure() for TCP sockets */
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
-	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
+	if (mem_cgroup_sk_enabled(sk) &&
 	    mem_cgroup_under_socket_pressure(sk->sk_memcg))
 		return true;
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 000940ecf360..ab658fe23e1e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1032,7 +1032,7 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 	bool charged;
 	int pages;
 
-	if (!mem_cgroup_sockets_enabled || !sk->sk_memcg || !sk_has_account(sk))
+	if (!mem_cgroup_sk_enabled(sk) || !sk_has_account(sk))
 		return -EOPNOTSUPP;
 
 	if (!bytes)
@@ -3271,7 +3271,7 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 	sk_memory_allocated_add(sk, amt);
 	allocated = sk_memory_allocated(sk);
 
-	if (mem_cgroup_sockets_enabled && sk->sk_memcg) {
+	if (mem_cgroup_sk_enabled(sk)) {
 		memcg = sk->sk_memcg;
 		charged = mem_cgroup_charge_skmem(memcg, amt, gfp_memcg_charge());
 		if (!charged)
@@ -3398,7 +3398,7 @@ void __sk_mem_reduce_allocated(struct sock *sk, int amount)
 {
 	sk_memory_allocated_sub(sk, amount);
 
-	if (mem_cgroup_sockets_enabled && sk->sk_memcg)
+	if (mem_cgroup_sk_enabled(sk))
 		mem_cgroup_uncharge_skmem(sk->sk_memcg, amount);
 
 	if (sk_under_global_memory_pressure(sk) &&
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index caf11920a878..37fb320e6f70 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3578,7 +3578,7 @@ void sk_forced_mem_schedule(struct sock *sk, int size)
 	sk_forward_alloc_add(sk, amt << PAGE_SHIFT);
 	sk_memory_allocated_add(sk, amt);
 
-	if (mem_cgroup_sockets_enabled && sk->sk_memcg)
+	if (mem_cgroup_sk_enabled(sk))
 		mem_cgroup_charge_skmem(sk->sk_memcg, amt,
 					gfp_memcg_charge() | __GFP_NOFAIL);
 }
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 6fb635a95baf..4874147e0b17 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1723,7 +1723,7 @@ static void mptcp_attach_cgroup(struct sock *parent, struct sock *child)
 	}
 #endif /* CONFIG_SOCK_CGROUP_DATA */
 
-	if (mem_cgroup_sockets_enabled && parent->sk_memcg)
+	if (mem_cgroup_sk_enabled(parent))
 		mem_cgroup_sk_inherit(parent, child);
 }
 
-- 
2.51.0.rc0.205.g4a044479a3-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 08/12] net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge().
  2025-08-12 17:58 [PATCH v3 net-next 00/12] net-memcg: Decouple controlled memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (6 preceding siblings ...)
  2025-08-12 17:58 ` [PATCH v3 net-next 07/12] net-memcg: Introduce mem_cgroup_sk_enabled() Kuniyuki Iwashima
@ 2025-08-12 17:58 ` Kuniyuki Iwashima
  2025-08-13  1:47   ` Roman Gushchin
  2025-08-12 17:58 ` [PATCH v3 net-next 09/12] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure() Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-12 17:58 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Michal Koutný, Tejun Heo
  Cc: Simon Horman, Geliang Tang, Muchun Song, Mina Almasry,
	Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, mptcp, cgroups,
	linux-mm

We will store a flag in the lowest bit of sk->sk_memcg.

Then, we cannot pass the raw pointer to mem_cgroup_charge_skmem()
and mem_cgroup_uncharge_skmem().

Let's pass struct sock to the functions.

While at it, they are renamed to match other functions starting
with mem_cgroup_sk_.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/memcontrol.h      | 29 ++++++++++++++++++++++++-----
 mm/memcontrol.c                 | 18 +++++++++++-------
 net/core/sock.c                 | 24 +++++++++++-------------
 net/ipv4/inet_connection_sock.c |  2 +-
 net/ipv4/tcp_output.c           |  3 +--
 5 files changed, 48 insertions(+), 28 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 25921fbec685..0837d3de3a68 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1596,15 +1596,16 @@ static inline void mem_cgroup_flush_foreign(struct bdi_writeback *wb)
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 struct sock;
-bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
-			     gfp_t gfp_mask);
-void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
 #ifdef CONFIG_MEMCG
 extern struct static_key_false memcg_sockets_enabled_key;
 #define mem_cgroup_sockets_enabled static_branch_unlikely(&memcg_sockets_enabled_key)
+
 void mem_cgroup_sk_alloc(struct sock *sk);
 void mem_cgroup_sk_free(struct sock *sk);
 void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk);
+bool mem_cgroup_sk_charge(const struct sock *sk, unsigned int nr_pages,
+			  gfp_t gfp_mask);
+void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages);
 
 #if BITS_PER_LONG < 64
 static inline void mem_cgroup_set_socket_pressure(struct mem_cgroup *memcg)
@@ -1660,13 +1661,31 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
 void reparent_shrinker_deferred(struct mem_cgroup *memcg);
 #else
 #define mem_cgroup_sockets_enabled 0
-static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
-static inline void mem_cgroup_sk_free(struct sock *sk) { };
+
+static inline void mem_cgroup_sk_alloc(struct sock *sk)
+{
+}
+
+static inline void mem_cgroup_sk_free(struct sock *sk)
+{
+}
 
 static inline void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk)
 {
 }
 
+static inline bool mem_cgroup_sk_charge(const struct sock *sk,
+					unsigned int nr_pages,
+					gfp_t gfp_mask)
+{
+	return false;
+}
+
+static inline void mem_cgroup_sk_uncharge(const struct sock *sk,
+					  unsigned int nr_pages)
+{
+}
+
 static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 {
 	return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2db7df32fd7c..d32b7a547f42 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5039,17 +5039,19 @@ void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk)
 }
 
 /**
- * mem_cgroup_charge_skmem - charge socket memory
- * @memcg: memcg to charge
+ * mem_cgroup_sk_charge - charge socket memory
+ * @sk: socket in memcg to charge
  * @nr_pages: number of pages to charge
  * @gfp_mask: reclaim mode
  *
  * Charges @nr_pages to @memcg. Returns %true if the charge fit within
  * @memcg's configured limit, %false if it doesn't.
  */
-bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
-			     gfp_t gfp_mask)
+bool mem_cgroup_sk_charge(const struct sock *sk, unsigned int nr_pages,
+			  gfp_t gfp_mask)
 {
+	struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
+
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return memcg1_charge_skmem(memcg, nr_pages, gfp_mask);
 
@@ -5062,12 +5064,14 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
 }
 
 /**
- * mem_cgroup_uncharge_skmem - uncharge socket memory
- * @memcg: memcg to uncharge
+ * mem_cgroup_sk_uncharge - uncharge socket memory
+ * @sk: socket in memcg to uncharge
  * @nr_pages: number of pages to uncharge
  */
-void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
+void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages)
 {
+	struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
+
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
 		memcg1_uncharge_skmem(memcg, nr_pages);
 		return;
diff --git a/net/core/sock.c b/net/core/sock.c
index ab658fe23e1e..5537ca263858 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1041,8 +1041,8 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 	pages = sk_mem_pages(bytes);
 
 	/* pre-charge to memcg */
-	charged = mem_cgroup_charge_skmem(sk->sk_memcg, pages,
-					  GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	charged = mem_cgroup_sk_charge(sk, pages,
+				       GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!charged)
 		return -ENOMEM;
 
@@ -1054,7 +1054,7 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 	 */
 	if (allocated > sk_prot_mem_limits(sk, 1)) {
 		sk_memory_allocated_sub(sk, pages);
-		mem_cgroup_uncharge_skmem(sk->sk_memcg, pages);
+		mem_cgroup_sk_uncharge(sk, pages);
 		return -ENOMEM;
 	}
 	sk_forward_alloc_add(sk, pages << PAGE_SHIFT);
@@ -3263,17 +3263,16 @@ EXPORT_SYMBOL(sk_wait_data);
  */
 int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
+	bool memcg_enabled = false, charged = false;
 	struct proto *prot = sk->sk_prot;
-	struct mem_cgroup *memcg = NULL;
-	bool charged = false;
 	long allocated;
 
 	sk_memory_allocated_add(sk, amt);
 	allocated = sk_memory_allocated(sk);
 
 	if (mem_cgroup_sk_enabled(sk)) {
-		memcg = sk->sk_memcg;
-		charged = mem_cgroup_charge_skmem(memcg, amt, gfp_memcg_charge());
+		memcg_enabled = true;
+		charged = mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge());
 		if (!charged)
 			goto suppress_allocation;
 	}
@@ -3347,10 +3346,9 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 		 */
 		if (sk->sk_wmem_queued + size >= sk->sk_sndbuf) {
 			/* Force charge with __GFP_NOFAIL */
-			if (memcg && !charged) {
-				mem_cgroup_charge_skmem(memcg, amt,
-					gfp_memcg_charge() | __GFP_NOFAIL);
-			}
+			if (memcg_enabled && !charged)
+				mem_cgroup_sk_charge(sk, amt,
+						     gfp_memcg_charge() | __GFP_NOFAIL);
 			return 1;
 		}
 	}
@@ -3360,7 +3358,7 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 	sk_memory_allocated_sub(sk, amt);
 
 	if (charged)
-		mem_cgroup_uncharge_skmem(memcg, amt);
+		mem_cgroup_sk_uncharge(sk, amt);
 
 	return 0;
 }
@@ -3399,7 +3397,7 @@ void __sk_mem_reduce_allocated(struct sock *sk, int amount)
 	sk_memory_allocated_sub(sk, amount);
 
 	if (mem_cgroup_sk_enabled(sk))
-		mem_cgroup_uncharge_skmem(sk->sk_memcg, amount);
+		mem_cgroup_sk_uncharge(sk, amount);
 
 	if (sk_under_global_memory_pressure(sk) &&
 	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 93569bbe00f4..0ef1eacd539d 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -727,7 +727,7 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
 		}
 
 		if (amt)
-			mem_cgroup_charge_skmem(newsk->sk_memcg, amt, gfp);
+			mem_cgroup_sk_charge(newsk, amt, gfp);
 		kmem_cache_charge(newsk, gfp);
 
 		release_sock(newsk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 37fb320e6f70..dfbac0876d96 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3579,8 +3579,7 @@ void sk_forced_mem_schedule(struct sock *sk, int size)
 	sk_memory_allocated_add(sk, amt);
 
 	if (mem_cgroup_sk_enabled(sk))
-		mem_cgroup_charge_skmem(sk->sk_memcg, amt,
-					gfp_memcg_charge() | __GFP_NOFAIL);
+		mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge() | __GFP_NOFAIL);
 }
 
 /* Send a FIN. The caller locks the socket for us.
-- 
2.51.0.rc0.205.g4a044479a3-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 09/12] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure().
  2025-08-12 17:58 [PATCH v3 net-next 00/12] net-memcg: Decouple controlled memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (7 preceding siblings ...)
  2025-08-12 17:58 ` [PATCH v3 net-next 08/12] net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge() Kuniyuki Iwashima
@ 2025-08-12 17:58 ` Kuniyuki Iwashima
  2025-08-13  1:49   ` Roman Gushchin
  2025-08-13  1:49   ` Roman Gushchin
  2025-08-12 17:58 ` [PATCH v3 net-next 10/12] net: Define sk_memcg under CONFIG_MEMCG Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-12 17:58 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Michal Koutný, Tejun Heo
  Cc: Simon Horman, Geliang Tang, Muchun Song, Mina Almasry,
	Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, mptcp, cgroups,
	linux-mm

We will store a flag in the lowest bit of sk->sk_memcg.

Then, we cannot pass the raw pointer to mem_cgroup_under_socket_pressure().

Let's pass struct sock to it and rename the function to match other
functions starting with mem_cgroup_sk_.

Note that the helper is moved to sock.h to use mem_cgroup_from_sk().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/memcontrol.h | 18 ------------------
 include/net/proto_memory.h |  2 +-
 include/net/sock.h         | 22 ++++++++++++++++++++++
 include/net/tcp.h          |  2 +-
 4 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0837d3de3a68..fb27e3d2fdac 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1642,19 +1642,6 @@ static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg)
 }
 #endif
 
-static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
-{
-#ifdef CONFIG_MEMCG_V1
-	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
-		return !!memcg->tcpmem_pressure;
-#endif /* CONFIG_MEMCG_V1 */
-	do {
-		if (time_before64(get_jiffies_64(), mem_cgroup_get_socket_pressure(memcg)))
-			return true;
-	} while ((memcg = parent_mem_cgroup(memcg)));
-	return false;
-}
-
 int alloc_shrinker_info(struct mem_cgroup *memcg);
 void free_shrinker_info(struct mem_cgroup *memcg);
 void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
@@ -1686,11 +1673,6 @@ static inline void mem_cgroup_sk_uncharge(const struct sock *sk,
 {
 }
 
-static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
-{
-	return false;
-}
-
 static inline void set_shrinker_bit(struct mem_cgroup *memcg,
 				    int nid, int shrinker_id)
 {
diff --git a/include/net/proto_memory.h b/include/net/proto_memory.h
index 859e63de81c4..8e91a8fa31b5 100644
--- a/include/net/proto_memory.h
+++ b/include/net/proto_memory.h
@@ -32,7 +32,7 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
 		return false;
 
 	if (mem_cgroup_sk_enabled(sk) &&
-	    mem_cgroup_under_socket_pressure(sk->sk_memcg))
+	    mem_cgroup_sk_under_memory_pressure(sk))
 		return true;
 
 	return !!READ_ONCE(*sk->sk_prot->memory_pressure);
diff --git a/include/net/sock.h b/include/net/sock.h
index 3efdf680401d..3bc4d566f7d0 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2604,6 +2604,23 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
 {
 	return mem_cgroup_sockets_enabled && mem_cgroup_from_sk(sk);
 }
+
+static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
+
+#ifdef CONFIG_MEMCG_V1
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return !!memcg->tcpmem_pressure;
+#endif /* CONFIG_MEMCG_V1 */
+
+	do {
+		if (time_before64(get_jiffies_64(), mem_cgroup_get_socket_pressure(memcg)))
+			return true;
+	} while ((memcg = parent_mem_cgroup(memcg)));
+
+	return false;
+}
 #else
 static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
 {
@@ -2614,6 +2631,11 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
 {
 	return false;
 }
+
+static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
+{
+	return false;
+}
 #endif
 
 static inline long sock_rcvtimeo(const struct sock *sk, bool noblock)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9f01b6be6444..2936b8175950 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -276,7 +276,7 @@ extern unsigned long tcp_memory_pressure;
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
 	if (mem_cgroup_sk_enabled(sk) &&
-	    mem_cgroup_under_socket_pressure(sk->sk_memcg))
+	    mem_cgroup_sk_under_memory_pressure(sk))
 		return true;
 
 	return READ_ONCE(tcp_memory_pressure);
-- 
2.51.0.rc0.205.g4a044479a3-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 10/12] net: Define sk_memcg under CONFIG_MEMCG.
  2025-08-12 17:58 [PATCH v3 net-next 00/12] net-memcg: Decouple controlled memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (8 preceding siblings ...)
  2025-08-12 17:58 ` [PATCH v3 net-next 09/12] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure() Kuniyuki Iwashima
@ 2025-08-12 17:58 ` Kuniyuki Iwashima
  2025-08-12 17:58 ` [PATCH v3 net-next 11/12] net-memcg: Store MEMCG_SOCK_ISOLATED in sk->sk_memcg Kuniyuki Iwashima
  2025-08-12 17:58 ` [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting Kuniyuki Iwashima
  11 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-12 17:58 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Michal Koutný, Tejun Heo
  Cc: Simon Horman, Geliang Tang, Muchun Song, Mina Almasry,
	Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, mptcp, cgroups,
	linux-mm

Except for sk_clone_lock(), all accesses to sk->sk_memcg
is done under CONFIG_MEMCG.

As a bonus, let's define sk->sk_memcg under CONFIG_MEMCG.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sock.h | 2 ++
 net/core/sock.c    | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 3bc4d566f7d0..1c49ea13af4a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -443,7 +443,9 @@ struct sock {
 	__cacheline_group_begin(sock_read_rxtx);
 	int			sk_err;
 	struct socket		*sk_socket;
+#ifdef CONFIG_MEMCG
 	struct mem_cgroup	*sk_memcg;
+#endif
 #ifdef CONFIG_XFRM
 	struct xfrm_policy __rcu *sk_policy[2];
 #endif
diff --git a/net/core/sock.c b/net/core/sock.c
index 5537ca263858..ab6953d295df 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2512,8 +2512,10 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 
 	sock_reset_flag(newsk, SOCK_DONE);
 
+#ifdef CONFIG_MEMCG
 	/* sk->sk_memcg will be populated at accept() time */
 	newsk->sk_memcg = NULL;
+#endif
 
 	cgroup_sk_clone(&newsk->sk_cgrp_data);
 
@@ -4452,7 +4454,9 @@ static int __init sock_struct_check(void)
 
 	CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_read_rxtx, sk_err);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_read_rxtx, sk_socket);
+#ifdef CONFIG_MEMCG
 	CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_read_rxtx, sk_memcg);
+#endif
 
 	CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_write_rxtx, sk_lock);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_write_rxtx, sk_reserved_mem);
-- 
2.51.0.rc0.205.g4a044479a3-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 11/12] net-memcg: Store MEMCG_SOCK_ISOLATED in sk->sk_memcg.
  2025-08-12 17:58 [PATCH v3 net-next 00/12] net-memcg: Decouple controlled memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (9 preceding siblings ...)
  2025-08-12 17:58 ` [PATCH v3 net-next 10/12] net: Define sk_memcg under CONFIG_MEMCG Kuniyuki Iwashima
@ 2025-08-12 17:58 ` Kuniyuki Iwashima
  2025-08-12 17:58 ` [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting Kuniyuki Iwashima
  11 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-12 17:58 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Michal Koutný, Tejun Heo
  Cc: Simon Horman, Geliang Tang, Muchun Song, Mina Almasry,
	Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, mptcp, cgroups,
	linux-mm

We will decouple sockets from the global protocol memory accounting
if the cgroup's memory.max is not "max" (PAGE_COUNTER_MAX).

memory.max can change at any time, so we must snapshot the state
for each socket to ensure consistency.

Given sk->sk_memcg can be accessed in the fast path, it would
be preferable to place the flag field in the same cache line as
sk->sk_memcg.

However, struct sock does not have such a 1-byte hole.

Let's store the flag in the lowest bit of sk->sk_memcg and add
a helper to check the bit.

In the next patch, if mem_cgroup_sk_isolated() returns true,
the socket will not be charged to sk->sk_prot->memory_allocated.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v2:
  * Set MEMCG_SOCK_ISOLATED based on memory.max instead of
    a dedicated knob
---
 include/net/sock.h | 23 ++++++++++++++++++++++-
 mm/memcontrol.c    | 14 ++++++++++++--
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 1c49ea13af4a..29ba5fdaafd6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2597,9 +2597,18 @@ static inline gfp_t gfp_memcg_charge(void)
 }
 
 #ifdef CONFIG_MEMCG
+
+#define MEMCG_SOCK_ISOLATED	1UL
+#define MEMCG_SOCK_FLAG_MASK	MEMCG_SOCK_ISOLATED
+#define MEMCG_SOCK_PTR_MASK	~(MEMCG_SOCK_FLAG_MASK)
+
 static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
 {
-	return sk->sk_memcg;
+	unsigned long val = (unsigned long)sk->sk_memcg;
+
+	val &= MEMCG_SOCK_PTR_MASK;
+
+	return (struct mem_cgroup *)val;
 }
 
 static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
@@ -2607,6 +2616,13 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
 	return mem_cgroup_sockets_enabled && mem_cgroup_from_sk(sk);
 }
 
+static inline bool mem_cgroup_sk_isolated(const struct sock *sk)
+{
+	struct mem_cgroup *memcg = sk->sk_memcg;
+
+	return (unsigned long)memcg & MEMCG_SOCK_ISOLATED;
+}
+
 static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
@@ -2634,6 +2650,11 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
 	return false;
 }
 
+static inline bool mem_cgroup_sk_isolated(const struct sock *sk)
+{
+	return false;
+}
+
 static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
 {
 	return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d32b7a547f42..cb5b8a9d21db 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4995,6 +4995,16 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
 DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
 EXPORT_SYMBOL(memcg_sockets_enabled_key);
 
+static void mem_cgroup_sk_set(struct sock *sk, const struct mem_cgroup *memcg)
+{
+	unsigned long val = (unsigned long)memcg;
+
+	if (READ_ONCE(memcg->memory.max) != PAGE_COUNTER_MAX)
+		val |= MEMCG_SOCK_ISOLATED;
+
+	sk->sk_memcg = (struct mem_cgroup *)val;
+}
+
 void mem_cgroup_sk_alloc(struct sock *sk)
 {
 	struct mem_cgroup *memcg;
@@ -5013,7 +5023,7 @@ void mem_cgroup_sk_alloc(struct sock *sk)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg1_tcpmem_active(memcg))
 		goto out;
 	if (css_tryget(&memcg->css))
-		sk->sk_memcg = memcg;
+		mem_cgroup_sk_set(sk, memcg);
 out:
 	rcu_read_unlock();
 }
@@ -5035,7 +5045,7 @@ void mem_cgroup_sk_inherit(const struct sock *sk, struct sock *newsk)
 
 	mem_cgroup_sk_free(newsk);
 	css_get(&memcg->css);
-	newsk->sk_memcg = memcg;
+	mem_cgroup_sk_set(newsk, memcg);
 }
 
 /**
-- 
2.51.0.rc0.205.g4a044479a3-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting.
  2025-08-12 17:58 [PATCH v3 net-next 00/12] net-memcg: Decouple controlled memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (10 preceding siblings ...)
  2025-08-12 17:58 ` [PATCH v3 net-next 11/12] net-memcg: Store MEMCG_SOCK_ISOLATED in sk->sk_memcg Kuniyuki Iwashima
@ 2025-08-12 17:58 ` Kuniyuki Iwashima
  2025-08-13  1:57   ` Roman Gushchin
                     ` (2 more replies)
  11 siblings, 3 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-12 17:58 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Michal Koutný, Tejun Heo
  Cc: Simon Horman, Geliang Tang, Muchun Song, Mina Almasry,
	Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, mptcp, cgroups,
	linux-mm

Some protocols (e.g., TCP, UDP) implement memory accounting for socket
buffers and charge memory to per-protocol global counters pointed to by
sk->sk_proto->memory_allocated.

When running under a non-root cgroup, this memory is also charged to the
memcg as "sock" in memory.stat.

Even when a memcg controls memory usage, sockets of such protocols are
still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).

This makes it difficult to accurately estimate and configure appropriate
global limits, especially in multi-tenant environments.

If all workloads were guaranteed to be controlled under memcg, the issue
could be worked around by setting tcp_mem[0~2] to UINT_MAX.

In reality, this assumption does not always hold, and processes that
belong to the root cgroup or opt out of memcg can consume memory up to
the global limit, becoming a noisy neighbour.

Let's decouple memcg from the global per-protocol memory accounting if
it has a finite memory.max (!= "max").

We still keep charging memory to memcg and protocol duplicately if
memcg has "max" in memory.max because TCP allows only 10% of physical
memory by default.

This simplifies memcg configuration while keeping the global limits
within a reasonable range.

If mem_cgroup_sk_isolated(sk) returns true, the per-protocol memory
accounting is skipped.

In inet_csk_accept(), we need to reclaim counts that are already charged
for child sockets because we do not allocate sk->sk_memcg until accept().

Note that trace_sock_exceed_buf_limit() will always show 0 as accounted
for the isolated sockets, but this can be obtained via memory.stat.

Tested with a script that creates local socket pairs and send()s a
bunch of data without recv()ing.

Setup:

  # mkdir /sys/fs/cgroup/test
  # echo $$ >> /sys/fs/cgroup/test/cgroup.procs
  # sysctl -q net.ipv4.tcp_mem="1000 1000 1000"

Without setting memory.max:

  # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 22642688
  # ss -tn | head -n 5
  State Recv-Q Send-Q Local Address:Port  Peer Address:Port
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53188
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:49972
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53868
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53554
  # nstat | grep Pressure || echo no pressure
  TcpExtTCPMemoryPressures        1                  0.0

With memory.max:

  # echo $((64 * 1024 ** 3)) > /sys/fs/cgroup/test/memory.max
  # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 2757468160
  # ss -tn | head -n 5
  State Recv-Q Send-Q  Local Address:Port  Peer Address:Port
  ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:49026
  ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:45630
  ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:44870
  ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:45274
  # nstat | grep Pressure || echo no pressure
  no pressure

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v3:
  * Fix build failure for kTLS

v2:
  * Add sk_should_enter_memory_pressure() for
    tcp_enter_memory_pressure() calls not in core
  * Update example in changelog
---
 include/net/proto_memory.h      | 15 ++++++--
 include/net/tcp.h               | 10 ++++--
 net/core/sock.c                 | 64 ++++++++++++++++++++++-----------
 net/ipv4/inet_connection_sock.c | 18 ++++++++--
 net/ipv4/tcp.c                  |  3 +-
 net/ipv4/tcp_output.c           | 10 ++++--
 net/mptcp/protocol.c            |  4 ++-
 net/tls/tls_device.c            |  4 ++-
 8 files changed, 94 insertions(+), 34 deletions(-)

diff --git a/include/net/proto_memory.h b/include/net/proto_memory.h
index 8e91a8fa31b5..8e8432b13515 100644
--- a/include/net/proto_memory.h
+++ b/include/net/proto_memory.h
@@ -31,13 +31,22 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
 	if (!sk->sk_prot->memory_pressure)
 		return false;
 
-	if (mem_cgroup_sk_enabled(sk) &&
-	    mem_cgroup_sk_under_memory_pressure(sk))
-		return true;
+	if (mem_cgroup_sk_enabled(sk)) {
+		if (mem_cgroup_sk_under_memory_pressure(sk))
+			return true;
+
+		if (mem_cgroup_sk_isolated(sk))
+			return false;
+	}
 
 	return !!READ_ONCE(*sk->sk_prot->memory_pressure);
 }
 
+static inline bool sk_should_enter_memory_pressure(struct sock *sk)
+{
+	return !mem_cgroup_sk_enabled(sk) || !mem_cgroup_sk_isolated(sk);
+}
+
 static inline long
 proto_memory_allocated(const struct proto *prot)
 {
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2936b8175950..0191a4585bba 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -275,9 +275,13 @@ extern unsigned long tcp_memory_pressure;
 /* optimized version of sk_under_memory_pressure() for TCP sockets */
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
-	if (mem_cgroup_sk_enabled(sk) &&
-	    mem_cgroup_sk_under_memory_pressure(sk))
-		return true;
+	if (mem_cgroup_sk_enabled(sk)) {
+		if (mem_cgroup_sk_under_memory_pressure(sk))
+			return true;
+
+		if (mem_cgroup_sk_isolated(sk))
+			return false;
+	}
 
 	return READ_ONCE(tcp_memory_pressure);
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index ab6953d295df..755540215570 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1046,17 +1046,21 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 	if (!charged)
 		return -ENOMEM;
 
-	/* pre-charge to forward_alloc */
-	sk_memory_allocated_add(sk, pages);
-	allocated = sk_memory_allocated(sk);
-	/* If the system goes into memory pressure with this
-	 * precharge, give up and return error.
-	 */
-	if (allocated > sk_prot_mem_limits(sk, 1)) {
-		sk_memory_allocated_sub(sk, pages);
-		mem_cgroup_sk_uncharge(sk, pages);
-		return -ENOMEM;
+	if (!mem_cgroup_sk_isolated(sk)) {
+		/* pre-charge to forward_alloc */
+		sk_memory_allocated_add(sk, pages);
+		allocated = sk_memory_allocated(sk);
+
+		/* If the system goes into memory pressure with this
+		 * precharge, give up and return error.
+		 */
+		if (allocated > sk_prot_mem_limits(sk, 1)) {
+			sk_memory_allocated_sub(sk, pages);
+			mem_cgroup_sk_uncharge(sk, pages);
+			return -ENOMEM;
+		}
 	}
+
 	sk_forward_alloc_add(sk, pages << PAGE_SHIFT);
 
 	WRITE_ONCE(sk->sk_reserved_mem,
@@ -3153,8 +3157,11 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 	if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation)))
 		return true;
 
-	sk_enter_memory_pressure(sk);
+	if (sk_should_enter_memory_pressure(sk))
+		sk_enter_memory_pressure(sk);
+
 	sk_stream_moderate_sndbuf(sk);
+
 	return false;
 }
 EXPORT_SYMBOL(sk_page_frag_refill);
@@ -3267,18 +3274,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
 	bool memcg_enabled = false, charged = false;
 	struct proto *prot = sk->sk_prot;
-	long allocated;
-
-	sk_memory_allocated_add(sk, amt);
-	allocated = sk_memory_allocated(sk);
+	long allocated = 0;
 
 	if (mem_cgroup_sk_enabled(sk)) {
+		bool isolated = mem_cgroup_sk_isolated(sk);
+
 		memcg_enabled = true;
 		charged = mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge());
-		if (!charged)
+
+		if (isolated && charged)
+			return 1;
+
+		if (!charged) {
+			if (!isolated) {
+				sk_memory_allocated_add(sk, amt);
+				allocated = sk_memory_allocated(sk);
+			}
+
 			goto suppress_allocation;
+		}
 	}
 
+	sk_memory_allocated_add(sk, amt);
+	allocated = sk_memory_allocated(sk);
+
 	/* Under limit. */
 	if (allocated <= sk_prot_mem_limits(sk, 0)) {
 		sk_leave_memory_pressure(sk);
@@ -3357,7 +3376,8 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 
 	trace_sock_exceed_buf_limit(sk, prot, allocated, kind);
 
-	sk_memory_allocated_sub(sk, amt);
+	if (allocated)
+		sk_memory_allocated_sub(sk, amt);
 
 	if (charged)
 		mem_cgroup_sk_uncharge(sk, amt);
@@ -3396,11 +3416,15 @@ EXPORT_SYMBOL(__sk_mem_schedule);
  */
 void __sk_mem_reduce_allocated(struct sock *sk, int amount)
 {
-	sk_memory_allocated_sub(sk, amount);
-
-	if (mem_cgroup_sk_enabled(sk))
+	if (mem_cgroup_sk_enabled(sk)) {
 		mem_cgroup_sk_uncharge(sk, amount);
 
+		if (mem_cgroup_sk_isolated(sk))
+			return;
+	}
+
+	sk_memory_allocated_sub(sk, amount);
+
 	if (sk_under_global_memory_pressure(sk) &&
 	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
 		sk_leave_memory_pressure(sk);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 0ef1eacd539d..9d56085f7f54 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -22,6 +22,7 @@
 #include <net/tcp.h>
 #include <net/sock_reuseport.h>
 #include <net/addrconf.h>
+#include <net/proto_memory.h>
 
 #if IS_ENABLED(CONFIG_IPV6)
 /* match_sk*_wildcard == true:  IPV6_ADDR_ANY equals to any IPv6 addresses
@@ -710,7 +711,6 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
 
 	if (mem_cgroup_sockets_enabled) {
 		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
-		int amt = 0;
 
 		/* atomically get the memory usage, set and charge the
 		 * newsk->sk_memcg.
@@ -719,15 +719,27 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
 
 		mem_cgroup_sk_alloc(newsk);
 		if (mem_cgroup_from_sk(newsk)) {
+			int amt;
+
 			/* The socket has not been accepted yet, no need
 			 * to look at newsk->sk_wmem_queued.
 			 */
 			amt = sk_mem_pages(newsk->sk_forward_alloc +
 					   atomic_read(&newsk->sk_rmem_alloc));
+			if (amt) {
+				/* This amt is already charged globally to
+				 * sk_prot->memory_allocated due to lack of
+				 * sk_memcg until accept(), thus we need to
+				 * reclaim it here if newsk is isolated.
+				 */
+				if (mem_cgroup_sk_isolated(newsk))
+					sk_memory_allocated_sub(newsk, amt);
+
+				mem_cgroup_sk_charge(newsk, amt, gfp);
+			}
+
 		}
 
-		if (amt)
-			mem_cgroup_sk_charge(newsk, amt, gfp);
 		kmem_cache_charge(newsk, gfp);
 
 		release_sock(newsk);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 71a956fbfc55..dcbd49e2f8af 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -908,7 +908,8 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
 		}
 		__kfree_skb(skb);
 	} else {
-		sk->sk_prot->enter_memory_pressure(sk);
+		if (sk_should_enter_memory_pressure(sk))
+			tcp_enter_memory_pressure(sk);
 		sk_stream_moderate_sndbuf(sk);
 	}
 	return NULL;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index dfbac0876d96..f7aa86661219 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3574,12 +3574,18 @@ void sk_forced_mem_schedule(struct sock *sk, int size)
 	delta = size - sk->sk_forward_alloc;
 	if (delta <= 0)
 		return;
+
 	amt = sk_mem_pages(delta);
 	sk_forward_alloc_add(sk, amt << PAGE_SHIFT);
-	sk_memory_allocated_add(sk, amt);
 
-	if (mem_cgroup_sk_enabled(sk))
+	if (mem_cgroup_sk_enabled(sk)) {
 		mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge() | __GFP_NOFAIL);
+
+		if (mem_cgroup_sk_isolated(sk))
+			return;
+	}
+
+	sk_memory_allocated_add(sk, amt);
 }
 
 /* Send a FIN. The caller locks the socket for us.
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 9a287b75c1b3..1a4089b05a16 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -16,6 +16,7 @@
 #include <net/inet_common.h>
 #include <net/inet_hashtables.h>
 #include <net/protocol.h>
+#include <net/proto_memory.h>
 #include <net/tcp_states.h>
 #if IS_ENABLED(CONFIG_MPTCP_IPV6)
 #include <net/transp_v6.h>
@@ -1016,8 +1017,9 @@ static void mptcp_enter_memory_pressure(struct sock *sk)
 	mptcp_for_each_subflow(msk, subflow) {
 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
 
-		if (first)
+		if (first && sk_should_enter_memory_pressure(sk))
 			tcp_enter_memory_pressure(ssk);
+
 		sk_stream_moderate_sndbuf(ssk);
 
 		first = false;
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index f672a62a9a52..6696ef837116 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -35,6 +35,7 @@
 #include <linux/netdevice.h>
 #include <net/dst.h>
 #include <net/inet_connection_sock.h>
+#include <net/proto_memory.h>
 #include <net/tcp.h>
 #include <net/tls.h>
 #include <linux/skbuff_ref.h>
@@ -371,7 +372,8 @@ static int tls_do_allocation(struct sock *sk,
 	if (!offload_ctx->open_record) {
 		if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
 						   sk->sk_allocation))) {
-			READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk);
+			if (sk_should_enter_memory_pressure(sk))
+				READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk);
 			sk_stream_moderate_sndbuf(sk);
 			return -ENOMEM;
 		}
-- 
2.51.0.rc0.205.g4a044479a3-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 06/12] net-memcg: Introduce mem_cgroup_from_sk().
  2025-08-12 17:58 ` [PATCH v3 net-next 06/12] net-memcg: Introduce mem_cgroup_from_sk() Kuniyuki Iwashima
@ 2025-08-13  1:44   ` Roman Gushchin
  0 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-08-13  1:44 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Andrew Morton,
	Michal Koutný, Tejun Heo, Simon Horman, Geliang Tang,
	Muchun Song, Mina Almasry, Kuniyuki Iwashima, netdev, mptcp,
	cgroups, linux-mm

Kuniyuki Iwashima <kuniyu@google.com> writes:

> We will store a flag in the lowest bit of sk->sk_memcg.
>
> Then, directly dereferencing sk->sk_memcg will be illegal, and we
> do not want to allow touching the raw sk->sk_memcg in many places.
>
> Let's introduce mem_cgroup_from_sk().
>
> Other places accessing the raw sk->sk_memcg will be converted later.
>
> Note that we cannot define the helper as an inline function in
> memcontrol.h as we cannot access any fields of struct sock there
> due to circular dependency, so it is placed in sock.h.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> Reviewed-by: Eric Dumazet <edumazet@google.com>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 07/12] net-memcg: Introduce mem_cgroup_sk_enabled().
  2025-08-12 17:58 ` [PATCH v3 net-next 07/12] net-memcg: Introduce mem_cgroup_sk_enabled() Kuniyuki Iwashima
@ 2025-08-13  1:46   ` Roman Gushchin
  0 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-08-13  1:46 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Andrew Morton,
	Michal Koutný, Tejun Heo, Simon Horman, Geliang Tang,
	Muchun Song, Mina Almasry, Kuniyuki Iwashima, netdev, mptcp,
	cgroups, linux-mm

Kuniyuki Iwashima <kuniyu@google.com> writes:

> The socket memcg feature is enabled by a static key and
> only works for non-root cgroup.
>
> We check both conditions in many places.
>
> Let's factorise it as a helper function.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> Reviewed-by: Eric Dumazet <edumazet@google.com>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 08/12] net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge().
  2025-08-12 17:58 ` [PATCH v3 net-next 08/12] net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge() Kuniyuki Iwashima
@ 2025-08-13  1:47   ` Roman Gushchin
  0 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-08-13  1:47 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Andrew Morton,
	Michal Koutný, Tejun Heo, Simon Horman, Geliang Tang,
	Muchun Song, Mina Almasry, Kuniyuki Iwashima, netdev, mptcp,
	cgroups, linux-mm

Kuniyuki Iwashima <kuniyu@google.com> writes:

> We will store a flag in the lowest bit of sk->sk_memcg.
>
> Then, we cannot pass the raw pointer to mem_cgroup_charge_skmem()
> and mem_cgroup_uncharge_skmem().
>
> Let's pass struct sock to the functions.
>
> While at it, they are renamed to match other functions starting
> with mem_cgroup_sk_.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> Reviewed-by: Eric Dumazet <edumazet@google.com>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 09/12] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure().
  2025-08-12 17:58 ` [PATCH v3 net-next 09/12] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure() Kuniyuki Iwashima
@ 2025-08-13  1:49   ` Roman Gushchin
  2025-08-13  1:49   ` Roman Gushchin
  1 sibling, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-08-13  1:49 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Andrew Morton,
	Michal Koutný, Tejun Heo, Simon Horman, Geliang Tang,
	Muchun Song, Mina Almasry, Kuniyuki Iwashima, netdev, mptcp,
	cgroups, linux-mm

Kuniyuki Iwashima <kuniyu@google.com> writes:

> We will store a flag in the lowest bit of sk->sk_memcg.
>
> Then, we cannot pass the raw pointer to mem_cgroup_under_socket_pressure().
>
> Let's pass struct sock to it and rename the function to match other
> functions starting with mem_cgroup_sk_.
>
> Note that the helper is moved to sock.h to use mem_cgroup_from_sk().
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> Reviewed-by: Eric Dumazet <edumazet@google.com>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 09/12] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure().
  2025-08-12 17:58 ` [PATCH v3 net-next 09/12] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure() Kuniyuki Iwashima
  2025-08-13  1:49   ` Roman Gushchin
@ 2025-08-13  1:49   ` Roman Gushchin
  1 sibling, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2025-08-13  1:49 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Andrew Morton,
	Michal Koutný, Tejun Heo, Simon Horman, Geliang Tang,
	Muchun Song, Mina Almasry, Kuniyuki Iwashima, netdev, mptcp,
	cgroups, linux-mm

Kuniyuki Iwashima <kuniyu@google.com> writes:

> We will store a flag in the lowest bit of sk->sk_memcg.
>
> Then, we cannot pass the raw pointer to mem_cgroup_under_socket_pressure().
>
> Let's pass struct sock to it and rename the function to match other
> functions starting with mem_cgroup_sk_.
>
> Note that the helper is moved to sock.h to use mem_cgroup_from_sk().
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> Reviewed-by: Eric Dumazet <edumazet@google.com>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting.
  2025-08-12 17:58 ` [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting Kuniyuki Iwashima
@ 2025-08-13  1:57   ` Roman Gushchin
  2025-08-13  5:32     ` Kuniyuki Iwashima
  2025-08-13  7:11   ` Shakeel Butt
  2025-08-13 13:00   ` Johannes Weiner
  2 siblings, 1 reply; 33+ messages in thread
From: Roman Gushchin @ 2025-08-13  1:57 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Andrew Morton,
	Michal Koutný, Tejun Heo, Simon Horman, Geliang Tang,
	Muchun Song, Mina Almasry, Kuniyuki Iwashima, netdev, mptcp,
	cgroups, linux-mm

Kuniyuki Iwashima <kuniyu@google.com> writes:

> Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> buffers and charge memory to per-protocol global counters pointed to by
> sk->sk_proto->memory_allocated.
>
> When running under a non-root cgroup, this memory is also charged to the
> memcg as "sock" in memory.stat.
>
> Even when a memcg controls memory usage, sockets of such protocols are
> still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
>
> This makes it difficult to accurately estimate and configure appropriate
> global limits, especially in multi-tenant environments.
>
> If all workloads were guaranteed to be controlled under memcg, the issue
> could be worked around by setting tcp_mem[0~2] to UINT_MAX.
>
> In reality, this assumption does not always hold, and processes that
> belong to the root cgroup or opt out of memcg can consume memory up to
> the global limit, becoming a noisy neighbour.
>
> Let's decouple memcg from the global per-protocol memory accounting if
> it has a finite memory.max (!= "max").

I think you can't make the new behavior as the new default, simple because
it might break existing setups. Basically anyone who is using memory.max
will suddenly have their processes being opted out of the net memory
accounting. Idk how many users are actually relying on the network
memory accounting, but I believe way more than 0.

So I guess a net sysctl/some other knob is warranted here, with the old
behavior being the default.

Thanks

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting.
  2025-08-13  1:57   ` Roman Gushchin
@ 2025-08-13  5:32     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-13  5:32 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Andrew Morton,
	Michal Koutný, Tejun Heo, Simon Horman, Geliang Tang,
	Muchun Song, Mina Almasry, Kuniyuki Iwashima, netdev, mptcp,
	cgroups, linux-mm

On Tue, Aug 12, 2025 at 6:58 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Kuniyuki Iwashima <kuniyu@google.com> writes:
>
> > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > buffers and charge memory to per-protocol global counters pointed to by
> > sk->sk_proto->memory_allocated.
> >
> > When running under a non-root cgroup, this memory is also charged to the
> > memcg as "sock" in memory.stat.
> >
> > Even when a memcg controls memory usage, sockets of such protocols are
> > still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> >
> > This makes it difficult to accurately estimate and configure appropriate
> > global limits, especially in multi-tenant environments.
> >
> > If all workloads were guaranteed to be controlled under memcg, the issue
> > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> >
> > In reality, this assumption does not always hold, and processes that
> > belong to the root cgroup or opt out of memcg can consume memory up to
> > the global limit, becoming a noisy neighbour.
> >
> > Let's decouple memcg from the global per-protocol memory accounting if
> > it has a finite memory.max (!= "max").
>
> I think you can't make the new behavior as the new default, simple because
> it might break existing setups. Basically anyone who is using memory.max
> will suddenly have their processes being opted out of the net memory
> accounting. Idk how many users are actually relying on the network
> memory accounting, but I believe way more than 0.
>
> So I guess a net sysctl/some other knob is warranted here, with the old
> behavior being the default.

I think we don't need a knob to change the behaviour because
the affected case must have a broken assumption.

There are 3 possible cases below.

1) memory.max == "max"

2) memory.max != "max" and tcp_mem does not suppress
   memory allocation

3) memory.max != "max" and tcp_mem suppresses memory
   allocation

1) is not affected, and 2) is not affected too because decoupling
does not change the situation.

Then, for 3), this change will allow more memory than ever,
but it's still limited by memory.max, which is configured by
the sys admin.

If this could be a problem, then the total amount of all memcg's
memory.max should exceed the amount of system memory,
which is unlikely if configured properly.

Also, in the 3) case, TCP has quite bad performance and the
sys admin should have raised the tcp_mem limit and moved
to 2) like our unlimited tcp_mem setting.

What do you think ?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting.
  2025-08-12 17:58 ` [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting Kuniyuki Iwashima
  2025-08-13  1:57   ` Roman Gushchin
@ 2025-08-13  7:11   ` Shakeel Butt
  2025-08-13 18:19     ` Kuniyuki Iwashima
  2025-08-13 13:00   ` Johannes Weiner
  2 siblings, 1 reply; 33+ messages in thread
From: Shakeel Butt @ 2025-08-13  7:11 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
	Michal Koutný, Tejun Heo, Simon Horman, Geliang Tang,
	Muchun Song, Mina Almasry, Kuniyuki Iwashima, netdev, mptcp,
	cgroups, linux-mm

On Tue, Aug 12, 2025 at 05:58:30PM +0000, Kuniyuki Iwashima wrote:
> Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> buffers and charge memory to per-protocol global counters pointed to by
> sk->sk_proto->memory_allocated.
> 
> When running under a non-root cgroup, this memory is also charged to the
> memcg as "sock" in memory.stat.
> 
> Even when a memcg controls memory usage, sockets of such protocols are
> still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> 
> This makes it difficult to accurately estimate and configure appropriate
> global limits, especially in multi-tenant environments.
> 
> If all workloads were guaranteed to be controlled under memcg, the issue
> could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> 
> In reality, this assumption does not always hold, and processes that
> belong to the root cgroup or opt out of memcg can consume memory up to
> the global limit, becoming a noisy neighbour.

Processes running in root memcg (I am not sure what does 'opt out of
memcg means') means admin has intentionally allowed scenarios where
noisy neighbour situation can happen, so I am not really following your
argument here.

> 
> Let's decouple memcg from the global per-protocol memory accounting if
> it has a finite memory.max (!= "max").

Why decouple only for some? (Also if you really want to check memcg
limits, you need to check limits for all ancestors and not just the
given memcg).

Why not start with just two global options (maybe start with boot
parameter)?

Option 1: Existing behavior where memcg and global TCP accounting are
coupled.

Option 2: Completely decouple memcg and global TCP accounting i.e. use
mem_cgroup_sockets_enabled to either do global TCP accounting or memcg
accounting.

Keep the option 1 default.

I assume you want third option where a mix of these options can happen
i.e. some sockets are only accounted to a memcg and some are accounted
to both memcg and global TCP. I would recommend to make that a followup
patch series. Keep this series simple and non-controversial.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 01/12] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n.
  2025-08-12 17:58 ` [PATCH v3 net-next 01/12] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n Kuniyuki Iwashima
@ 2025-08-13  8:54   ` Matthieu Baerts
  2025-08-14 12:30   ` Michal Koutný
  1 sibling, 0 replies; 33+ messages in thread
From: Matthieu Baerts @ 2025-08-13  8:54 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Neal Cardwell, Paolo Abeni, Willem de Bruijn, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Michal Koutný, Tejun Heo
  Cc: Simon Horman, Geliang Tang, Muchun Song, Mina Almasry,
	Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm

Hi Kuniyuki,

On 12/08/2025 19:58, Kuniyuki Iwashima wrote:
> When sk_alloc() allocates a socket, mem_cgroup_sk_alloc() sets
> sk->sk_memcg based on the current task.
> 
> MPTCP subflow socket creation is triggered from userspace or
> an in-kernel worker.
> 
> In the latter case, sk->sk_memcg is not what we want.  So, we fix
> it up from the parent socket's sk->sk_memcg in mptcp_attach_cgroup().
> 
> Although the code is placed under #ifdef CONFIG_MEMCG, it is buried
> under #ifdef CONFIG_SOCK_CGROUP_DATA.
> 
> The two configs are orthogonal.  If CONFIG_MEMCG is enabled without
> CONFIG_SOCK_CGROUP_DATA, the subflow's memory usage is not charged
> correctly.
> 
> Let's move the code out of the wrong ifdef guard.
> 
> Note that sk->sk_memcg is freed in sk_prot_free() and the parent
> sk holds the refcnt of memcg->css here, so we don't need to use
> css_tryget().

Thank you for the patch!

Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 02/12] mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready().
  2025-08-12 17:58 ` [PATCH v3 net-next 02/12] mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready() Kuniyuki Iwashima
@ 2025-08-13  8:54   ` Matthieu Baerts
  0 siblings, 0 replies; 33+ messages in thread
From: Matthieu Baerts @ 2025-08-13  8:54 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Neal Cardwell, Paolo Abeni, Willem de Bruijn, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Michal Koutný, Tejun Heo
  Cc: Simon Horman, Geliang Tang, Muchun Song, Mina Almasry,
	Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm

Hi Kuniyuki,

On 12/08/2025 19:58, Kuniyuki Iwashima wrote:
> Some conditions used in mptcp_epollin_ready() are the same as
> tcp_under_memory_pressure().
> 
> We will modify tcp_under_memory_pressure() in the later patch.
> 
> Let's use tcp_under_memory_pressure() instead.

Good idea, thanks!

Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting.
  2025-08-12 17:58 ` [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting Kuniyuki Iwashima
  2025-08-13  1:57   ` Roman Gushchin
  2025-08-13  7:11   ` Shakeel Butt
@ 2025-08-13 13:00   ` Johannes Weiner
  2025-08-13 18:43     ` Kuniyuki Iwashima
  2 siblings, 1 reply; 33+ messages in thread
From: Johannes Weiner @ 2025-08-13 13:00 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
	Michal Koutný, Tejun Heo, Simon Horman, Geliang Tang,
	Muchun Song, Mina Almasry, Kuniyuki Iwashima, netdev, mptcp,
	cgroups, linux-mm

On Tue, Aug 12, 2025 at 05:58:30PM +0000, Kuniyuki Iwashima wrote:
> If all workloads were guaranteed to be controlled under memcg, the issue
> could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> 
> In reality, this assumption does not always hold, and processes that
> belong to the root cgroup or opt out of memcg can consume memory up to
> the global limit, becoming a noisy neighbour.

As per the last thread, this is not a supported usecase. Opting out of
memcg coverage for individual cgroups is a self-inflicted problem and
misconfiguration. There is *no* memory isolation *at all* on such
containers. Maybe their socket buffers is the only thing that happens
to matter to *you*, but this is in no way a generic, universal,
upstreamable solution. Knob or auto-detection is not the issue.

Nacked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting.
  2025-08-13  7:11   ` Shakeel Butt
@ 2025-08-13 18:19     ` Kuniyuki Iwashima
  2025-08-13 20:53       ` Shakeel Butt
  0 siblings, 1 reply; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-13 18:19 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
	Michal Koutný, Tejun Heo, Simon Horman, Geliang Tang,
	Muchun Song, Mina Almasry, Kuniyuki Iwashima, netdev, mptcp,
	cgroups, linux-mm

On Wed, Aug 13, 2025 at 12:11 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Aug 12, 2025 at 05:58:30PM +0000, Kuniyuki Iwashima wrote:
> > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > buffers and charge memory to per-protocol global counters pointed to by
> > sk->sk_proto->memory_allocated.
> >
> > When running under a non-root cgroup, this memory is also charged to the
> > memcg as "sock" in memory.stat.
> >
> > Even when a memcg controls memory usage, sockets of such protocols are
> > still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> >
> > This makes it difficult to accurately estimate and configure appropriate
> > global limits, especially in multi-tenant environments.
> >
> > If all workloads were guaranteed to be controlled under memcg, the issue
> > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> >
> > In reality, this assumption does not always hold, and processes that
> > belong to the root cgroup or opt out of memcg can consume memory up to
> > the global limit, becoming a noisy neighbour.
>
> Processes running in root memcg (I am not sure what does 'opt out of
> memcg means')

Sorry, I should've clarified memory.max==max (and same
up to all ancestors as you pointed out below) as opt-out,
where memcg works but has no effect.


> means admin has intentionally allowed scenarios where

Not really intentionally, but rather reluctantly because the admin
cannot guarantee memory.max solely without tcp_mem=UINT_MAX.
We should not disregard the cause that the two mem accounting are
coupled now.


> noisy neighbour situation can happen, so I am not really following your
> argument here.

So basically here I meant with tcp_mem=UINT_MAX any process
can be noisy neighbour unnecessarily.


>
> >
> > Let's decouple memcg from the global per-protocol memory accounting if
> > it has a finite memory.max (!= "max").
>
> Why decouple only for some? (Also if you really want to check memcg
> limits, you need to check limits for all ancestors and not just the
> given memcg).

Oh, I assumed memory.max will be inherited to descendants.


>
> Why not start with just two global options (maybe start with boot
> parameter)?
>
> Option 1: Existing behavior where memcg and global TCP accounting are
> coupled.
>
> Option 2: Completely decouple memcg and global TCP accounting i.e. use
> mem_cgroup_sockets_enabled to either do global TCP accounting or memcg
> accounting.
>
> Keep the option 1 default.
>
> I assume you want third option where a mix of these options can happen
> i.e. some sockets are only accounted to a memcg and some are accounted
> to both memcg and global TCP.

Yes because usually not all memcg have memory.max configured
and we do not want to allow unlimited TCP memory for them.

Option 2 works for processes in the root cgroup but doesn't for
processes in non-root cgroup with memory.max == max.

A good example is system processes managed by systemd where
we do not want to specify memory.max but want a global seatbelt.

Note this is how it works _now_, and we want to _preserve_ the case.
Does this make sense  ? > why decouple only for some


> I would recommend to make that a followup
> patch series. Keep this series simple and non-controversial.

I can separate the series, but I'd like to make sure the
Option 2 is a must for you or Meta configured memory.max
for all cgroups ?  I didn't think it's likely but if there's a real
use case, I'm happy to add a boot param.

The only diff would be boot param addition and the condition
change in patch 11 so simplicity won't change.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting.
  2025-08-13 13:00   ` Johannes Weiner
@ 2025-08-13 18:43     ` Kuniyuki Iwashima
  2025-08-13 20:21       ` Johannes Weiner
  0 siblings, 1 reply; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-13 18:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
	Michal Koutný, Tejun Heo, Simon Horman, Geliang Tang,
	Muchun Song, Mina Almasry, Kuniyuki Iwashima, netdev, mptcp,
	cgroups, linux-mm

On Wed, Aug 13, 2025 at 6:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Aug 12, 2025 at 05:58:30PM +0000, Kuniyuki Iwashima wrote:
> > If all workloads were guaranteed to be controlled under memcg, the issue
> > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> >
> > In reality, this assumption does not always hold, and processes that
> > belong to the root cgroup or opt out of memcg can consume memory up to
> > the global limit, becoming a noisy neighbour.
>
> As per the last thread, this is not a supported usecase. Opting out of
> memcg coverage for individual cgroups is a self-inflicted problem and
> misconfiguration. There is *no* memory isolation *at all* on such
> containers.

I think the commit message needs to be improved, but could
you read throughout the patch again ?  I think you have the
same misunderstanding that Shakeel had and corrected here.
https://lore.kernel.org/netdev/jmbszz4m7xkw7fzolpusjesbreaczmr4i64kynbs3zcoehrkpj@lwso5soc4dh3/

---8<---
Initially, I thought the series introduced multiple modes, including an
option to exclude network memory from memcg accounting. However, if I
understand correctly, that is not the case—the opt-out applies only to
the global TCP/UDP accounting. That’s a relief, and I apologize for the
misunderstanding.
---8<---

This patch does NOT change how memcg is applied to sockets
but changes how _another_ memory accounting in the networking
layer is applied to sockets.

Currently, memcg AND the other mem accounting are applied
to socket buffers.

With/without this patch, memcg is _always_ applied to socket
buffers.

Also, there is _no_ behavioural change for _uncontrolled
containers_ that have been subject to the two memory
accounting.  This behaviour hasn't been changed since
you added memcg support for the networking stack in
e805605c72102, and we want to _preserve_ this behaviour.

This change stop double-charging by opting out of _the
networking layer one_ because it interferes with memcg
and complicates configuration of memory.max and the
global networking limit.


> Maybe their socket buffers is the only thing that happens
> to matter to *you*, but this is in no way a generic, universal,
> upstreamable solution. Knob or auto-detection is not the issue.
>
> Nacked-by: Johannes Weiner <hannes@cmpxchg.org>

Please let me know if this nack still applies with the
explanation above.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting.
  2025-08-13 18:43     ` Kuniyuki Iwashima
@ 2025-08-13 20:21       ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2025-08-13 20:21 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton,
	Michal Koutný, Tejun Heo, Simon Horman, Geliang Tang,
	Muchun Song, Mina Almasry, Kuniyuki Iwashima, netdev, mptcp,
	cgroups, linux-mm

On Wed, Aug 13, 2025 at 11:43:15AM -0700, Kuniyuki Iwashima wrote:
> On Wed, Aug 13, 2025 at 6:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> This change stop double-charging by opting out of _the
> networking layer one_ because it interferes with memcg
> and complicates configuration of memory.max and the
> global networking limit.

No, we do want the global limits as a backstop - even if every single
cgroup in the system has its own memory limit.

Sure, from a fairness POV, we want socket buffers accounted towards
the containers' memory footprint and subject to their limits.

But that doesn't imply that we can let the cgroup limit be the only
thing curbing an explosion in socket buffers.

This isn't about fairness, but about host stability.

The MM can easily get rid of file cache and heap pages, but it has
limited to no control over the socket buffer lifetime. If you split a
1TB host into 8 containers limited to ~128G, that doesn't mean you
want to allow up to 1TB of memory in socket buffers. That could make
low memory situations unrecoverable.

> > Maybe their socket buffers is the only thing that happens
> > to matter to *you*, but this is in no way a generic, universal,
> > upstreamable solution. Knob or auto-detection is not the issue.
> >
> > Nacked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Please let me know if this nack still applies with the
> explanation above.

Yes, for one I think it's an unacceptable behavioral change of the
sysctl semantics.

But my wider point is that I think you're trying to fix something that
is a direct result of a flawed approach to containerization, and it
would make much more sense to address that instead.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting.
  2025-08-13 18:19     ` Kuniyuki Iwashima
@ 2025-08-13 20:53       ` Shakeel Butt
  2025-08-14  0:54         ` Martin KaFai Lau
  0 siblings, 1 reply; 33+ messages in thread
From: Shakeel Butt @ 2025-08-13 20:53 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
	Michal Koutný, Tejun Heo, Simon Horman, Geliang Tang,
	Muchun Song, Mina Almasry, Kuniyuki Iwashima, netdev, mptcp,
	cgroups, linux-mm, bpf

On Wed, Aug 13, 2025 at 11:19:31AM -0700, Kuniyuki Iwashima wrote:
> On Wed, Aug 13, 2025 at 12:11 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Tue, Aug 12, 2025 at 05:58:30PM +0000, Kuniyuki Iwashima wrote:
> > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > > buffers and charge memory to per-protocol global counters pointed to by
> > > sk->sk_proto->memory_allocated.
> > >
> > > When running under a non-root cgroup, this memory is also charged to the
> > > memcg as "sock" in memory.stat.
> > >
> > > Even when a memcg controls memory usage, sockets of such protocols are
> > > still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> > >
> > > This makes it difficult to accurately estimate and configure appropriate
> > > global limits, especially in multi-tenant environments.
> > >
> > > If all workloads were guaranteed to be controlled under memcg, the issue
> > > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> > >
> > > In reality, this assumption does not always hold, and processes that
> > > belong to the root cgroup or opt out of memcg can consume memory up to
> > > the global limit, becoming a noisy neighbour.
> >
> > Processes running in root memcg (I am not sure what does 'opt out of
> > memcg means')
> 
> Sorry, I should've clarified memory.max==max (and same
> up to all ancestors as you pointed out below) as opt-out,
> where memcg works but has no effect.
> 
> 
> > means admin has intentionally allowed scenarios where
> 
> Not really intentionally, but rather reluctantly because the admin
> cannot guarantee memory.max solely without tcp_mem=UINT_MAX.
> We should not disregard the cause that the two mem accounting are
> coupled now.
> 
> 
> > noisy neighbour situation can happen, so I am not really following your
> > argument here.
> 
> So basically here I meant with tcp_mem=UINT_MAX any process
> can be noisy neighbour unnecessarily.

Only if there are processes in cgroups with unlimited memory limits.

I think you are still missing the point. So, let me be very clear:

Please stop using the "processes in cgroup with memory.max==max can be
source of isolation issues" argument. Having unlimited limit means you
don't want isolation. More importantly you don't really need this
argument for your work. It is clear (to me at least) that we want global
TCP memory accounting to be decoupled from memcg accounting. Using the
flawed argument is just making your series controversial.

[...]
> > Why not start with just two global options (maybe start with boot
> > parameter)?
> >
> > Option 1: Existing behavior where memcg and global TCP accounting are
> > coupled.
> >
> > Option 2: Completely decouple memcg and global TCP accounting i.e. use
> > mem_cgroup_sockets_enabled to either do global TCP accounting or memcg
> > accounting.
> >
> > Keep the option 1 default.
> >
> > I assume you want third option where a mix of these options can happen
> > i.e. some sockets are only accounted to a memcg and some are accounted
> > to both memcg and global TCP.
> 
> Yes because usually not all memcg have memory.max configured
> and we do not want to allow unlimited TCP memory for them.
> 
> Option 2 works for processes in the root cgroup but doesn't for
> processes in non-root cgroup with memory.max == max.
> 
> A good example is system processes managed by systemd where
> we do not want to specify memory.max but want a global seatbelt.
> 
> Note this is how it works _now_, and we want to _preserve_ the case.
> Does this make sense  ? > why decouple only for some
> 

I hope I am very clear to stop using the memory.max == max argument.

> 
> > I would recommend to make that a followup
> > patch series. Keep this series simple and non-controversial.
> 
> I can separate the series, but I'd like to make sure the
> Option 2 is a must for you or Meta configured memory.max
> for all cgroups ?  I didn't think it's likely but if there's a real
> use case, I'm happy to add a boot param.
> 
> The only diff would be boot param addition and the condition
> change in patch 11 so simplicity won't change.

I am not sure if option 2 will be used by Meta or someone else, so no
objection from me to not pursue it. However I don't want some possibly
userspace policy to opt-in in one or the other accounting mechanism in
the kernel.

What I think is the right approach is to have BPF struct ops based
approach with possible callback 'is this socket under pressure' or maybe
'is this socket isolated' and then you can do whatever you want in those
callbacks. In this way your can follow the same approach of caching the
result in kernel (lower bits of sk->sk_memcg).

I am CCing bpf list to get some suggestions or concerns on this
approach.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting.
  2025-08-13 20:53       ` Shakeel Butt
@ 2025-08-14  0:54         ` Martin KaFai Lau
  2025-08-14  4:34           ` Kuniyuki Iwashima
  0 siblings, 1 reply; 33+ messages in thread
From: Martin KaFai Lau @ 2025-08-14  0:54 UTC (permalink / raw)
  To: Shakeel Butt, Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Andrew Morton,
	Michal Koutný, Tejun Heo, Simon Horman, Geliang Tang,
	Muchun Song, Mina Almasry, Kuniyuki Iwashima, netdev, mptcp,
	cgroups, linux-mm, bpf

On 8/13/25 1:53 PM, Shakeel Butt wrote:
> What I think is the right approach is to have BPF struct ops based
> approach with possible callback 'is this socket under pressure' or maybe
> 'is this socket isolated' and then you can do whatever you want in those
> callbacks. In this way your can follow the same approach of caching the
> result in kernel (lower bits of sk->sk_memcg).
> 
> I am CCing bpf list to get some suggestions or concerns on this
> approach.

I have quickly looked at the set. In patch 11, it sets a bit in sk->sk_memcg.

On the bpf side, there are already cgroup bpf progs that can do bpf_setsockopt 
on a sk, so the same can be done here. The bpf_setsockopt does not have to set 
option/knob that is only available in the uapi in case we don't want to expose 
this to the user space.

The cgroup bpf prog (BPF_CGROUP_INET_SOCK_CREATE) can already be run when a 
"inet" sock is created. This hook (i.e. attach_type) does not have access to 
bpf_setsockopt but should be easy to add.

For more comprehensive mem charge policy that needs new bpf hook, that probably 
will need struct_ops instead of another cgroup attach_type but that will be 
implementation details.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting.
  2025-08-14  0:54         ` Martin KaFai Lau
@ 2025-08-14  4:34           ` Kuniyuki Iwashima
  2025-08-14 17:10             ` Shakeel Butt
  0 siblings, 1 reply; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-14  4:34 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Shakeel Butt, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Neal Cardwell, Paolo Abeni, Willem de Bruijn, Matthieu Baerts,
	Mat Martineau, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Andrew Morton, Michal Koutný, Tejun Heo, Simon Horman,
	Geliang Tang, Muchun Song, Mina Almasry, Kuniyuki Iwashima,
	netdev, mptcp, cgroups, linux-mm, bpf

On Wed, Aug 13, 2025 at 5:55 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 8/13/25 1:53 PM, Shakeel Butt wrote:
> > What I think is the right approach is to have BPF struct ops based
> > approach with possible callback 'is this socket under pressure' or maybe
> > 'is this socket isolated' and then you can do whatever you want in those
> > callbacks. In this way your can follow the same approach of caching the
> > result in kernel (lower bits of sk->sk_memcg).
> >
> > I am CCing bpf list to get some suggestions or concerns on this
> > approach.
>
> I have quickly looked at the set. In patch 11, it sets a bit in sk->sk_memcg.
>
> On the bpf side, there are already cgroup bpf progs that can do bpf_setsockopt
> on a sk, so the same can be done here. The bpf_setsockopt does not have to set
> option/knob that is only available in the uapi in case we don't want to expose
> this to the user space.
>
> The cgroup bpf prog (BPF_CGROUP_INET_SOCK_CREATE) can already be run when a
> "inet" sock is created. This hook (i.e. attach_type) does not have access to
> bpf_setsockopt but should be easy to add.

Okay, I will try the bpf_setsockopt() approach.
Should I post patch 1-10 to net-next separately ?
They are pure net material to gather memcg code under CONFIG_MEMCG.


>
> For more comprehensive mem charge policy that needs new bpf hook, that probably
> will need struct_ops instead of another cgroup attach_type but that will be
> implementation details.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 01/12] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n.
  2025-08-12 17:58 ` [PATCH v3 net-next 01/12] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n Kuniyuki Iwashima
  2025-08-13  8:54   ` Matthieu Baerts
@ 2025-08-14 12:30   ` Michal Koutný
  2025-08-14 19:17     ` Kuniyuki Iwashima
  1 sibling, 1 reply; 33+ messages in thread
From: Michal Koutný @ 2025-08-14 12:30 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Neal Cardwell,
	Paolo Abeni, Willem de Bruijn, Matthieu Baerts, Mat Martineau,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Andrew Morton, Tejun Heo, Simon Horman, Geliang Tang, Muchun Song,
	Mina Almasry, Kuniyuki Iwashima, netdev, mptcp, cgroups, linux-mm

[-- Attachment #1: Type: text/plain, Size: 872 bytes --]

Hello.

On Tue, Aug 12, 2025 at 05:58:19PM +0000, Kuniyuki Iwashima <kuniyu@google.com> wrote:
> When sk_alloc() allocates a socket, mem_cgroup_sk_alloc() sets
> sk->sk_memcg based on the current task.
> 
> MPTCP subflow socket creation is triggered from userspace or
> an in-kernel worker.

I somewhat remembered
d752a4986532c ("net: memcg: late association of sock to memcg")

but IIUC this MPTCP codepath, the socket would never be visible to
userspace nor manipulated from a proper process context, so there is no
option to defer the association in similar fashion, correct?

Then, I wonder whether this isn't a scenario for
	o = set_active_memcg(sk->sk_memcg);
	newsk =  sk_alloc();
	...
	set_active_memcg(o);

i.e. utilize the existing remote charging infra instead of introducing
specific mem_cgroup_sk_inherit() helper.

Regards,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting.
  2025-08-14  4:34           ` Kuniyuki Iwashima
@ 2025-08-14 17:10             ` Shakeel Butt
  0 siblings, 0 replies; 33+ messages in thread
From: Shakeel Butt @ 2025-08-14 17:10 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Martin KaFai Lau, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Neal Cardwell, Paolo Abeni, Willem de Bruijn, Matthieu Baerts,
	Mat Martineau, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Andrew Morton, Michal Koutný, Tejun Heo, Simon Horman,
	Geliang Tang, Muchun Song, Mina Almasry, Kuniyuki Iwashima,
	netdev, mptcp, cgroups, linux-mm, bpf

On Wed, Aug 13, 2025 at 09:34:01PM -0700, Kuniyuki Iwashima wrote:
> On Wed, Aug 13, 2025 at 5:55 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 8/13/25 1:53 PM, Shakeel Butt wrote:
> > > What I think is the right approach is to have BPF struct ops based
> > > approach with possible callback 'is this socket under pressure' or maybe
> > > 'is this socket isolated' and then you can do whatever you want in those
> > > callbacks. In this way your can follow the same approach of caching the
> > > result in kernel (lower bits of sk->sk_memcg).
> > >
> > > I am CCing bpf list to get some suggestions or concerns on this
> > > approach.
> >
> > I have quickly looked at the set. In patch 11, it sets a bit in sk->sk_memcg.
> >
> > On the bpf side, there are already cgroup bpf progs that can do bpf_setsockopt
> > on a sk, so the same can be done here. The bpf_setsockopt does not have to set
> > option/knob that is only available in the uapi in case we don't want to expose
> > this to the user space.
> >
> > The cgroup bpf prog (BPF_CGROUP_INET_SOCK_CREATE) can already be run when a
> > "inet" sock is created. This hook (i.e. attach_type) does not have access to
> > bpf_setsockopt but should be easy to add.
> 
> Okay, I will try the bpf_setsockopt() approach.
> Should I post patch 1-10 to net-next separately ?
> They are pure net material to gather memcg code under CONFIG_MEMCG.

Yes please.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 net-next 01/12] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n.
  2025-08-14 12:30   ` Michal Koutný
@ 2025-08-14 19:17     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-14 19:17 UTC (permalink / raw)
  To: mkoutny
  Cc: akpm, almasrymina, cgroups, davem, edumazet, geliang, hannes,
	horms, kuba, kuni1840, kuniyu, linux-mm, martineau, matttbe,
	mhocko, mptcp, muchun.song, ncardwell, netdev, pabeni,
	roman.gushchin, shakeel.butt, tj, willemb

From: "Michal Koutný" <mkoutny@suse.com>
Date: Thu, 14 Aug 2025 14:30:05 +0200
> Hello.
> 
> On Tue, Aug 12, 2025 at 05:58:19PM +0000, Kuniyuki Iwashima <kuniyu@google.com> wrote:
> > When sk_alloc() allocates a socket, mem_cgroup_sk_alloc() sets
> > sk->sk_memcg based on the current task.
> > 
> > MPTCP subflow socket creation is triggered from userspace or
> > an in-kernel worker.
> 
> I somewhat remembered
> d752a4986532c ("net: memcg: late association of sock to memcg")
> 
> but IIUC this MPTCP codepath, the socket would never be visible to
> userspace nor manipulated from a proper process context, so there is no
> option to defer the association in similar fashion, correct?
> 
> Then, I wonder whether this isn't a scenario for
> 	o = set_active_memcg(sk->sk_memcg);
> 	newsk =  sk_alloc();
> 	...
> 	set_active_memcg(o);
> 
> i.e. utilize the existing remote charging infra instead of introducing
> specific mem_cgroup_sk_inherit() helper.

Sounds good to me.  sock_create_kern() is much larger than
other set_active_memcg() users, most of which just wrap simple
alloc() functions, but probably that's okay.

I'll use this in the next version.

Thanks!

---8<---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8dd7fbed5a94..450862e7fd7a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5006,8 +5006,11 @@ void mem_cgroup_sk_alloc(struct sock *sk)
 	if (!in_task())
 		return;
 
+	memcg = current->active_memcg;
+
 	rcu_read_lock();
-	memcg = mem_cgroup_from_task(current);
+	if (likely(!memcg))
+		memcg = mem_cgroup_from_task(current);
 	if (mem_cgroup_is_root(memcg))
 		goto out;
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg1_tcpmem_active(memcg))
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 3f1b62a9fe88..a4809054ea6c 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1717,14 +1717,6 @@ static void mptcp_attach_cgroup(struct sock *parent, struct sock *child)
 	/* only the additional subflows created by kworkers have to be modified */
 	if (cgroup_id(sock_cgroup_ptr(parent_skcd)) !=
 	    cgroup_id(sock_cgroup_ptr(child_skcd))) {
-#ifdef CONFIG_MEMCG
-		struct mem_cgroup *memcg = parent->sk_memcg;
-
-		mem_cgroup_sk_free(child);
-		if (memcg && css_tryget(&memcg->css))
-			child->sk_memcg = memcg;
-#endif /* CONFIG_MEMCG */
-
 		cgroup_sk_free(child_skcd);
 		*child_skcd = *parent_skcd;
 		cgroup_sk_clone(child_skcd);
@@ -1757,6 +1749,7 @@ int mptcp_subflow_create_socket(struct sock *sk, unsigned short family,
 {
 	struct mptcp_subflow_context *subflow;
 	struct net *net = sock_net(sk);
+	struct mem_cgroup *memcg;
 	struct socket *sf;
 	int err;
 
@@ -1766,7 +1759,9 @@ int mptcp_subflow_create_socket(struct sock *sk, unsigned short family,
 	if (unlikely(!sk->sk_socket))
 		return -EINVAL;
 
+	memcg = set_active_memcg(sk->sk_memcg);
 	err = sock_create_kern(net, family, SOCK_STREAM, IPPROTO_TCP, &sf);
+	set_active_memcg(memcg);
 	if (err)
 		return err;
 
---8<---

^ permalink raw reply related	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-08-14 19:17 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-12 17:58 [PATCH v3 net-next 00/12] net-memcg: Decouple controlled memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
2025-08-12 17:58 ` [PATCH v3 net-next 01/12] mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n Kuniyuki Iwashima
2025-08-13  8:54   ` Matthieu Baerts
2025-08-14 12:30   ` Michal Koutný
2025-08-14 19:17     ` Kuniyuki Iwashima
2025-08-12 17:58 ` [PATCH v3 net-next 02/12] mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready() Kuniyuki Iwashima
2025-08-13  8:54   ` Matthieu Baerts
2025-08-12 17:58 ` [PATCH v3 net-next 03/12] tcp: Simplify error path in inet_csk_accept() Kuniyuki Iwashima
2025-08-12 17:58 ` [PATCH v3 net-next 04/12] net: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV Kuniyuki Iwashima
2025-08-12 17:58 ` [PATCH v3 net-next 05/12] net: Clean up __sk_mem_raise_allocated() Kuniyuki Iwashima
2025-08-12 17:58 ` [PATCH v3 net-next 06/12] net-memcg: Introduce mem_cgroup_from_sk() Kuniyuki Iwashima
2025-08-13  1:44   ` Roman Gushchin
2025-08-12 17:58 ` [PATCH v3 net-next 07/12] net-memcg: Introduce mem_cgroup_sk_enabled() Kuniyuki Iwashima
2025-08-13  1:46   ` Roman Gushchin
2025-08-12 17:58 ` [PATCH v3 net-next 08/12] net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge() Kuniyuki Iwashima
2025-08-13  1:47   ` Roman Gushchin
2025-08-12 17:58 ` [PATCH v3 net-next 09/12] net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure() Kuniyuki Iwashima
2025-08-13  1:49   ` Roman Gushchin
2025-08-13  1:49   ` Roman Gushchin
2025-08-12 17:58 ` [PATCH v3 net-next 10/12] net: Define sk_memcg under CONFIG_MEMCG Kuniyuki Iwashima
2025-08-12 17:58 ` [PATCH v3 net-next 11/12] net-memcg: Store MEMCG_SOCK_ISOLATED in sk->sk_memcg Kuniyuki Iwashima
2025-08-12 17:58 ` [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting Kuniyuki Iwashima
2025-08-13  1:57   ` Roman Gushchin
2025-08-13  5:32     ` Kuniyuki Iwashima
2025-08-13  7:11   ` Shakeel Butt
2025-08-13 18:19     ` Kuniyuki Iwashima
2025-08-13 20:53       ` Shakeel Butt
2025-08-14  0:54         ` Martin KaFai Lau
2025-08-14  4:34           ` Kuniyuki Iwashima
2025-08-14 17:10             ` Shakeel Butt
2025-08-13 13:00   ` Johannes Weiner
2025-08-13 18:43     ` Kuniyuki Iwashima
2025-08-13 20:21       ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).