netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v10 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated.
@ 2025-09-20  0:07 Kuniyuki Iwashima
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-20  0:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

Some protocols (e.g., TCP, UDP) have their own memory accounting for
socket buffers and charge memory to global per-protocol counters such
as /proc/net/ipv4/tcp_mem.

If the socket has sk->sk_memcg, this memory is also charged to the memcg
as sock in memory.stat.

We do not need to pay costs for two orthogonal memory accounting
mechanisms.

This series allows decoupling memcg from the global memory accounting
(memcg + tcp_mem -> memcg) if socket is configured as such by sysctl
or BPF prog.


Overview of the series:

  patch 1 & 2 prepares for decoupling memcg from sk_prot->memory_allocated
    based on the SK_MEMCG_EXCLUSIVE flag
  patch 3 introduces net.core.memcg_exclusive
  patch 4 & 5 supports flagging SK_MEMCG_EXCLUSIVE via bpf_setsockopt()
  patch 6 is selftest


Changes:
  v10:
    * Patch 6:
      * Clean up iterations in ->create_sockets() (Martin)
      * Move ASSERT inside if to avoid noisy log in ->create_sockets()
      * Make run_test() static (Martin)
      * Remove fexit progs & possible infinite loop in get_memory_allocated() (Martin)
      * Use test__start_subtest() with if (Martin)
      * Drain UDP recv queue before close() (Martin)
      * Reduce NR_PAGES to 32 to make test faster

  v9: https://lore.kernel.org/netdev/20250917191417.1056739-1-kuniyu@google.com/
    * Patch 1: Drop sk_is_mptcp() as it's always true for MPTCP subflow
               and unnecessary for SCTP (kernel-test-robot)

  v8: https://lore.kernel.org/netdev/20250910192057.1045711-1-kuniyu@google.com/
    * Patch 3: Fix build failure when CONFIG_NET=n (kernel-test-robot)

  v7: https://lore.kernel.org/netdev/20250909204632.3994767-1-kuniyu@google.com/
    * Rename s/ISOLATED/EXCLUSIVE/ (Shakeel)
    * Add patch 3 (net.core.memcg_exclusive sysctl) (Shakeel)
    * Reorder the core patch 2 before sysctl + bpf changes
    * Patch 6
      * Add test for sysctl

  v6: https://lore.kernel.org/netdev/20250908223750.3375376-1-kuniyu@google.com/
    * Patch 4
      * Update commit message (Shakeel)
    * Patch 5
      * Trace sk_prot->memory_allocated + sk_prot->memory_per_cpu_fw_alloc (Martin)

  v5: https://lore.kernel.org/netdev/20250903190238.2511885-1-kuniyu@google.com/
    * Patch 2
      * Rename new variants to bpf_sock_create_{get,set}sockopt() (Martin)
    * Patch 3
      * Limit getsockopt() to BPF_CGROUP_INET_SOCK_CREATE (Martin)
    * Patch 5
      * Use kern_sync_rcu() (Martin)
      * Double NR_SEND to 128

  v4: https://lore.kernel.org/netdev/20250829010026.347440-1-kuniyu@google.com/
    * Patch 2
      * Use __bpf_setsockopt() instead of _bpf_setsockopt() (Martin)
      * Add getsockopt() for a cgroup with multiple bpf progs running (Martin)
    * Patch 3
      * Only allow inet_create() to set flags (Martin)
      * Inherit flags from listener to child in sk_clone_lock() (Martin)
      * Support clearing flags (Martin)
    * Patch 5
      * Only use inet_create() hook
      * Test bpf_getsockopt()
      * Add serial_ prefix
      * Reduce sleep() and the amount of sent data

  v3: https://lore.kernel.org/netdev/20250826183940.3310118-1-kuniyu@google.com/
    * Drop patches for accept() hook
    * Patch 1
      * Merge if blocks
    * Patch2
      * Drop bpf_func_proto for accept()
    * Patch 3
      * Allow flagging without sk->sk_memcg
      * Inherit SK_BPF_MEMCG_SOCK_ISOLATED in __inet_accept()

  v2: https://lore.kernel.org/bpf/20250825204158.2414402-1-kuniyu@google.com/
    * Patch 2
      * Define BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT() when CONFIG_CGROUP_BPF=n
    * Patch 5
      * Make 2 new bpf_func_proto static
    * Patch 6
      * s/mem_cgroup_sk_set_flag/mem_cgroup_sk_set_flags/ when CONFIG_MEMCG=n
      * Use finer CONFIG_CGROUP_BPF instead of CONFIG_BPF_SYSCALL for ifdef

  v1: https://lore.kernel.org/netdev/20250822221846.744252-1-kuniyu@google.com/


Kuniyuki Iwashima (6):
  tcp: Save lock_sock() for memcg in inet_csk_accept().
  net-memcg: Allow decoupling memcg from global protocol memory
    accounting.
  net-memcg: Introduce net.core.memcg_exclusive sysctl.
  bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE.
  bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_EXCLUSIVE.
  selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE.

 Documentation/admin-guide/sysctl/net.rst      |   9 +
 include/net/netns/core.h                      |   3 +
 include/net/proto_memory.h                    |  15 +-
 include/net/sock.h                            |  47 ++-
 include/net/tcp.h                             |  10 +-
 include/uapi/linux/bpf.h                      |   6 +
 mm/memcontrol.c                               |  15 +-
 net/core/filter.c                             |  82 +++++
 net/core/sock.c                               |  65 ++--
 net/core/sysctl_net_core.c                    |  11 +
 net/ipv4/af_inet.c                            |  36 +++
 net/ipv4/inet_connection_sock.c               |  26 +-
 net/ipv4/tcp.c                                |   3 +-
 net/ipv4/tcp_output.c                         |  10 +-
 net/mptcp/protocol.c                          |   3 +-
 net/tls/tls_device.c                          |   4 +-
 tools/include/uapi/linux/bpf.h                |   6 +
 .../selftests/bpf/prog_tests/sk_memcg.c       | 282 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/sk_memcg.c  | 105 +++++++
 19 files changed, 680 insertions(+), 58 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_memcg.c
 create mode 100644 tools/testing/selftests/bpf/progs/sk_memcg.c

-- 
2.51.0.470.ga7dc726c21-goog


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v10 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept().
  2025-09-20  0:07 [PATCH v10 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
@ 2025-09-20  0:07 ` Kuniyuki Iwashima
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-20  0:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

If memcg is enabled, accept() acquires lock_sock() twice for each new
TCP/MPTCP socket in inet_csk_accept() and __inet_accept().

Let's move memcg operations from inet_csk_accept() to __inet_accept().

Note that SCTP somehow allocates a new socket by sk_alloc() in
sk->sk_prot->accept() and clones fields manually, instead of using
sk_clone_lock().

mem_cgroup_sk_alloc() is called for SCTP before __inet_accept(),
so I added the protocol check in __inet_accept(), but this can be
removed once SCTP uses sk_clone_lock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
---
v9: Drop sk_is_mptcp() check as sk_is_tcp() is true for MPTCP subflow
v3: Don't split if blocks
---
 net/ipv4/af_inet.c              | 22 ++++++++++++++++++++++
 net/ipv4/inet_connection_sock.c | 25 -------------------------
 2 files changed, 22 insertions(+), 25 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 76e38092cd8a..c99724b3db04 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -753,6 +753,28 @@ EXPORT_SYMBOL(inet_stream_connect);
 
 void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *newsk)
 {
+	/* TODO: use sk_clone_lock() in SCTP and remove protocol checks */
+	if (mem_cgroup_sockets_enabled &&
+	    (!IS_ENABLED(CONFIG_IP_SCTP) || sk_is_tcp(newsk))) {
+		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
+
+		mem_cgroup_sk_alloc(newsk);
+
+		if (mem_cgroup_from_sk(newsk)) {
+			int amt;
+
+			/* The socket has not been accepted yet, no need
+			 * to look at newsk->sk_wmem_queued.
+			 */
+			amt = sk_mem_pages(newsk->sk_forward_alloc +
+					   atomic_read(&newsk->sk_rmem_alloc));
+			if (amt)
+				mem_cgroup_sk_charge(newsk, amt, gfp);
+		}
+
+		kmem_cache_charge(newsk, gfp);
+	}
+
 	sock_rps_record_flow(newsk);
 	WARN_ON(!((1 << newsk->sk_state) &
 		  (TCPF_ESTABLISHED | TCPF_SYN_RECV |
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 0ef1eacd539d..ed10b959a906 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -708,31 +708,6 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
 
 	release_sock(sk);
 
-	if (mem_cgroup_sockets_enabled) {
-		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
-		int amt = 0;
-
-		/* atomically get the memory usage, set and charge the
-		 * newsk->sk_memcg.
-		 */
-		lock_sock(newsk);
-
-		mem_cgroup_sk_alloc(newsk);
-		if (mem_cgroup_from_sk(newsk)) {
-			/* The socket has not been accepted yet, no need
-			 * to look at newsk->sk_wmem_queued.
-			 */
-			amt = sk_mem_pages(newsk->sk_forward_alloc +
-					   atomic_read(&newsk->sk_rmem_alloc));
-		}
-
-		if (amt)
-			mem_cgroup_sk_charge(newsk, amt, gfp);
-		kmem_cache_charge(newsk, gfp);
-
-		release_sock(newsk);
-	}
-
 	if (req)
 		reqsk_put(req);
 
-- 
2.51.0.470.ga7dc726c21-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v10 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting.
  2025-09-20  0:07 [PATCH v10 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
@ 2025-09-20  0:07 ` Kuniyuki Iwashima
  2025-09-20  5:25   ` Shakeel Butt
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-20  0:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

Some protocols (e.g., TCP, UDP) implement memory accounting for socket
buffers and charge memory to per-protocol global counters pointed to by
sk->sk_proto->memory_allocated.

If a socket has sk->sk_memcg, this memory is also charged to memcg as
"sock" in memory.stat.

We do not need to pay costs for two orthogonal memory accounting
mechanisms.  A microbenchmark result is in the subsequent bpf patch.

Let's decouple sockets under memcg from the global per-protocol memory
accounting if mem_cgroup_sk_exclusive() returns true.

Note that this does NOT disable memcg, but rather the per-protocol one.

mem_cgroup_sk_exclusive() starts to return true in the following patches,
and then, the per-protocol memory accounting will be skipped.

In __inet_accept(), we need to reclaim counts that are already charged
for child sockets because we do not allocate sk->sk_memcg until accept().

trace_sock_exceed_buf_limit() will always show 0 as accounted for the
memcg-exclusive sockets, but this can be obtained in memory.stat.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Nacked-by: Johannes Weiner <hannes@cmpxchg.org>
---
v7: Reorder before sysctl & bpf patches
v6: Update commit message
---
 include/net/proto_memory.h      | 15 ++++++--
 include/net/sock.h              | 10 ++++++
 include/net/tcp.h               | 10 ++++--
 net/core/sock.c                 | 64 ++++++++++++++++++++++-----------
 net/ipv4/af_inet.c              | 12 ++++++-
 net/ipv4/inet_connection_sock.c |  1 +
 net/ipv4/tcp.c                  |  3 +-
 net/ipv4/tcp_output.c           | 10 ++++--
 net/mptcp/protocol.c            |  3 +-
 net/tls/tls_device.c            |  4 ++-
 10 files changed, 100 insertions(+), 32 deletions(-)

diff --git a/include/net/proto_memory.h b/include/net/proto_memory.h
index 72d4ec413ab5..4383cb4cb2d2 100644
--- a/include/net/proto_memory.h
+++ b/include/net/proto_memory.h
@@ -31,13 +31,22 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
 	if (!sk->sk_prot->memory_pressure)
 		return false;
 
-	if (mem_cgroup_sk_enabled(sk) &&
-	    mem_cgroup_sk_under_memory_pressure(sk))
-		return true;
+	if (mem_cgroup_sk_enabled(sk)) {
+		if (mem_cgroup_sk_under_memory_pressure(sk))
+			return true;
+
+		if (mem_cgroup_sk_exclusive(sk))
+			return false;
+	}
 
 	return !!READ_ONCE(*sk->sk_prot->memory_pressure);
 }
 
+static inline bool sk_should_enter_memory_pressure(struct sock *sk)
+{
+	return !mem_cgroup_sk_enabled(sk) || !mem_cgroup_sk_exclusive(sk);
+}
+
 static inline long
 proto_memory_allocated(const struct proto *prot)
 {
diff --git a/include/net/sock.h b/include/net/sock.h
index 63a6a48afb48..66501ab670eb 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2607,6 +2607,11 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
 	return mem_cgroup_sockets_enabled && mem_cgroup_from_sk(sk);
 }
 
+static inline bool mem_cgroup_sk_exclusive(const struct sock *sk)
+{
+	return false;
+}
+
 static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
@@ -2634,6 +2639,11 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
 	return false;
 }
 
+static inline bool mem_cgroup_sk_exclusive(const struct sock *sk)
+{
+	return false;
+}
+
 static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
 {
 	return false;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2936b8175950..225f6bac06c3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -275,9 +275,13 @@ extern unsigned long tcp_memory_pressure;
 /* optimized version of sk_under_memory_pressure() for TCP sockets */
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
-	if (mem_cgroup_sk_enabled(sk) &&
-	    mem_cgroup_sk_under_memory_pressure(sk))
-		return true;
+	if (mem_cgroup_sk_enabled(sk)) {
+		if (mem_cgroup_sk_under_memory_pressure(sk))
+			return true;
+
+		if (mem_cgroup_sk_exclusive(sk))
+			return false;
+	}
 
 	return READ_ONCE(tcp_memory_pressure);
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 8002ac6293dc..814966309b0e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1046,17 +1046,21 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 	if (!charged)
 		return -ENOMEM;
 
-	/* pre-charge to forward_alloc */
-	sk_memory_allocated_add(sk, pages);
-	allocated = sk_memory_allocated(sk);
-	/* If the system goes into memory pressure with this
-	 * precharge, give up and return error.
-	 */
-	if (allocated > sk_prot_mem_limits(sk, 1)) {
-		sk_memory_allocated_sub(sk, pages);
-		mem_cgroup_sk_uncharge(sk, pages);
-		return -ENOMEM;
+	if (!mem_cgroup_sk_exclusive(sk)) {
+		/* pre-charge to forward_alloc */
+		sk_memory_allocated_add(sk, pages);
+		allocated = sk_memory_allocated(sk);
+
+		/* If the system goes into memory pressure with this
+		 * precharge, give up and return error.
+		 */
+		if (allocated > sk_prot_mem_limits(sk, 1)) {
+			sk_memory_allocated_sub(sk, pages);
+			mem_cgroup_sk_uncharge(sk, pages);
+			return -ENOMEM;
+		}
 	}
+
 	sk_forward_alloc_add(sk, pages << PAGE_SHIFT);
 
 	WRITE_ONCE(sk->sk_reserved_mem,
@@ -3153,8 +3157,11 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 	if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation)))
 		return true;
 
-	sk_enter_memory_pressure(sk);
+	if (sk_should_enter_memory_pressure(sk))
+		sk_enter_memory_pressure(sk);
+
 	sk_stream_moderate_sndbuf(sk);
+
 	return false;
 }
 EXPORT_SYMBOL(sk_page_frag_refill);
@@ -3267,18 +3274,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
 	bool memcg_enabled = false, charged = false;
 	struct proto *prot = sk->sk_prot;
-	long allocated;
-
-	sk_memory_allocated_add(sk, amt);
-	allocated = sk_memory_allocated(sk);
+	long allocated = 0;
 
 	if (mem_cgroup_sk_enabled(sk)) {
+		bool exclusive = mem_cgroup_sk_exclusive(sk);
+
 		memcg_enabled = true;
 		charged = mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge());
-		if (!charged)
+
+		if (exclusive && charged)
+			return 1;
+
+		if (!charged) {
+			if (!exclusive) {
+				sk_memory_allocated_add(sk, amt);
+				allocated = sk_memory_allocated(sk);
+			}
+
 			goto suppress_allocation;
+		}
 	}
 
+	sk_memory_allocated_add(sk, amt);
+	allocated = sk_memory_allocated(sk);
+
 	/* Under limit. */
 	if (allocated <= sk_prot_mem_limits(sk, 0)) {
 		sk_leave_memory_pressure(sk);
@@ -3357,7 +3376,8 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 
 	trace_sock_exceed_buf_limit(sk, prot, allocated, kind);
 
-	sk_memory_allocated_sub(sk, amt);
+	if (allocated)
+		sk_memory_allocated_sub(sk, amt);
 
 	if (charged)
 		mem_cgroup_sk_uncharge(sk, amt);
@@ -3396,11 +3416,15 @@ EXPORT_SYMBOL(__sk_mem_schedule);
  */
 void __sk_mem_reduce_allocated(struct sock *sk, int amount)
 {
-	sk_memory_allocated_sub(sk, amount);
-
-	if (mem_cgroup_sk_enabled(sk))
+	if (mem_cgroup_sk_enabled(sk)) {
 		mem_cgroup_sk_uncharge(sk, amount);
 
+		if (mem_cgroup_sk_exclusive(sk))
+			return;
+	}
+
+	sk_memory_allocated_sub(sk, amount);
+
 	if (sk_under_global_memory_pressure(sk) &&
 	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
 		sk_leave_memory_pressure(sk);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index c99724b3db04..c410eb525ebe 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -95,6 +95,7 @@
 #include <net/checksum.h>
 #include <net/ip.h>
 #include <net/protocol.h>
+#include <net/proto_memory.h>
 #include <net/arp.h>
 #include <net/route.h>
 #include <net/ip_fib.h>
@@ -768,8 +769,17 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
 			 */
 			amt = sk_mem_pages(newsk->sk_forward_alloc +
 					   atomic_read(&newsk->sk_rmem_alloc));
-			if (amt)
+			if (amt) {
+				/* This amt is already charged globally to
+				 * sk_prot->memory_allocated due to lack of
+				 * sk_memcg until accept(), thus we need to
+				 * reclaim it here if newsk is isolated.
+				 */
+				if (mem_cgroup_sk_exclusive(newsk))
+					sk_memory_allocated_sub(newsk, amt);
+
 				mem_cgroup_sk_charge(newsk, amt, gfp);
+			}
 		}
 
 		kmem_cache_charge(newsk, gfp);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index ed10b959a906..f8dd53d40dcf 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -22,6 +22,7 @@
 #include <net/tcp.h>
 #include <net/sock_reuseport.h>
 #include <net/addrconf.h>
+#include <net/proto_memory.h>
 
 #if IS_ENABLED(CONFIG_IPV6)
 /* match_sk*_wildcard == true:  IPV6_ADDR_ANY equals to any IPv6 addresses
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 71a956fbfc55..dcbd49e2f8af 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -908,7 +908,8 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
 		}
 		__kfree_skb(skb);
 	} else {
-		sk->sk_prot->enter_memory_pressure(sk);
+		if (sk_should_enter_memory_pressure(sk))
+			tcp_enter_memory_pressure(sk);
 		sk_stream_moderate_sndbuf(sk);
 	}
 	return NULL;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index dfbac0876d96..4b6a7250a9c2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3574,12 +3574,18 @@ void sk_forced_mem_schedule(struct sock *sk, int size)
 	delta = size - sk->sk_forward_alloc;
 	if (delta <= 0)
 		return;
+
 	amt = sk_mem_pages(delta);
 	sk_forward_alloc_add(sk, amt << PAGE_SHIFT);
-	sk_memory_allocated_add(sk, amt);
 
-	if (mem_cgroup_sk_enabled(sk))
+	if (mem_cgroup_sk_enabled(sk)) {
 		mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge() | __GFP_NOFAIL);
+
+		if (mem_cgroup_sk_exclusive(sk))
+			return;
+	}
+
+	sk_memory_allocated_add(sk, amt);
 }
 
 /* Send a FIN. The caller locks the socket for us.
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 9a287b75c1b3..f7487e22a3f8 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -16,6 +16,7 @@
 #include <net/inet_common.h>
 #include <net/inet_hashtables.h>
 #include <net/protocol.h>
+#include <net/proto_memory.h>
 #include <net/tcp_states.h>
 #if IS_ENABLED(CONFIG_MPTCP_IPV6)
 #include <net/transp_v6.h>
@@ -1016,7 +1017,7 @@ static void mptcp_enter_memory_pressure(struct sock *sk)
 	mptcp_for_each_subflow(msk, subflow) {
 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
 
-		if (first)
+		if (first && sk_should_enter_memory_pressure(ssk))
 			tcp_enter_memory_pressure(ssk);
 		sk_stream_moderate_sndbuf(ssk);
 
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index f672a62a9a52..6696ef837116 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -35,6 +35,7 @@
 #include <linux/netdevice.h>
 #include <net/dst.h>
 #include <net/inet_connection_sock.h>
+#include <net/proto_memory.h>
 #include <net/tcp.h>
 #include <net/tls.h>
 #include <linux/skbuff_ref.h>
@@ -371,7 +372,8 @@ static int tls_do_allocation(struct sock *sk,
 	if (!offload_ctx->open_record) {
 		if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
 						   sk->sk_allocation))) {
-			READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk);
+			if (sk_should_enter_memory_pressure(sk))
+				READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk);
 			sk_stream_moderate_sndbuf(sk);
 			return -ENOMEM;
 		}
-- 
2.51.0.470.ga7dc726c21-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v10 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl.
  2025-09-20  0:07 [PATCH v10 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
@ 2025-09-20  0:07 ` Kuniyuki Iwashima
  2025-09-20  5:35   ` Shakeel Butt
                     ` (2 more replies)
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 4/6] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-20  0:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

If net.core.memcg_exclusive is 1 when sk->sk_memcg is allocated,
the socket is flagged with SK_MEMCG_EXCLUSIVE internally and skips
the global per-protocol memory accounting.

OTOH, for accept()ed child sockets, this flag is inherited from
the listening socket in sk_clone_lock() and set in __inet_accept().
This is to preserve the decision by BPF which will be supported later.

Given sk->sk_memcg can be accessed in the fast path, it would
be preferable to place the flag field in the same cache line as
sk->sk_memcg.

However, struct sock does not have such a 1-byte hole.

Let's store the flag in the lowest bit of sk->sk_memcg and check
it in mem_cgroup_sk_exclusive().

Tested with a script that creates local socket pairs and send()s a
bunch of data without recv()ing.

Setup:

  # mkdir /sys/fs/cgroup/test
  # echo $$ >> /sys/fs/cgroup/test/cgroup.procs
  # sysctl -q net.ipv4.tcp_mem="1000 1000 1000"

Without net.core.memcg_exclusive, charged to memcg & tcp_mem:

  # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 22642688 <-------------------------------------- charged to memcg
  # cat /proc/net/sockstat| grep TCP
  TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 <-- charged to tcp_mem
  # ss -tn | head -n 5
  State Recv-Q Send-Q Local Address:Port  Peer Address:Port
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53188
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:49972
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53868
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53554
  # nstat | grep Pressure || echo no pressure
  TcpExtTCPMemoryPressures        1                  0.0

With net.core.memcg_exclusive=1, only charged to memcg:

  # sysctl -q net.core.memcg_exclusive=1
  # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 2757468160 <------------------------------------ charged to memcg
  # cat /proc/net/sockstat | grep TCP
  TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 <- NOT charged to tcp_mem
  # ss -tn | head -n 5
  State Recv-Q Send-Q  Local Address:Port  Peer Address:Port
  ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:49026
  ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:45630
  ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:44870
  ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:45274
  # nstat | grep Pressure || echo no pressure
  no pressure

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v8: Fix build failure when CONFIG_NET=n
---
 Documentation/admin-guide/sysctl/net.rst |  9 ++++++
 include/net/netns/core.h                 |  3 ++
 include/net/sock.h                       | 39 ++++++++++++++++++++++--
 mm/memcontrol.c                          | 12 +++++++-
 net/core/sock.c                          |  1 +
 net/core/sysctl_net_core.c               | 11 +++++++
 net/ipv4/af_inet.c                       |  4 +++
 7 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 2ef50828aff1..7272194dcf45 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -212,6 +212,15 @@ mem_pcpu_rsv
 
 Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU.
 
+memcg_exclusive
+---------------
+
+Skip charging socket buffers to the per-protocol global memory accounting
+(controlled by net.ipv4.tcp_mem, etc) if they are already charged to the
+cgroup memory controller ("sock" in memory.stat file).
+
+Default: 0
+
 rmem_default
 ------------
 
diff --git a/include/net/netns/core.h b/include/net/netns/core.h
index 9b36f0ff0c20..ec511088e67d 100644
--- a/include/net/netns/core.h
+++ b/include/net/netns/core.h
@@ -16,6 +16,9 @@ struct netns_core {
 	int	sysctl_optmem_max;
 	u8	sysctl_txrehash;
 	u8	sysctl_tstamp_allow_data;
+#ifdef CONFIG_MEMCG
+	u8	sysctl_memcg_exclusive;
+#endif
 
 #ifdef CONFIG_PROC_FS
 	struct prot_inuse __percpu *prot_inuse;
diff --git a/include/net/sock.h b/include/net/sock.h
index 66501ab670eb..0f597f4deaa3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2596,10 +2596,36 @@ static inline gfp_t gfp_memcg_charge(void)
 	return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
 }
 
+enum {
+	SK_MEMCG_EXCLUSIVE	= (1UL << 0),
+	SK_MEMCG_FLAG_MAX	= (1UL << 1),
+};
+
+#define SK_MEMCG_FLAG_MASK	(SK_MEMCG_FLAG_MAX - 1)
+#define SK_MEMCG_PTR_MASK	~SK_MEMCG_FLAG_MASK
+
 #ifdef CONFIG_MEMCG
 static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
 {
-	return sk->sk_memcg;
+	unsigned long val = (unsigned long)sk->sk_memcg;
+
+	val &= SK_MEMCG_PTR_MASK;
+	return (struct mem_cgroup *)val;
+}
+
+static inline void mem_cgroup_sk_set_flags(struct sock *sk, unsigned short flags)
+{
+	unsigned long val = (unsigned long)mem_cgroup_from_sk(sk);
+
+	val |= flags;
+	sk->sk_memcg = (struct mem_cgroup *)val;
+}
+
+static inline unsigned short mem_cgroup_sk_get_flags(const struct sock *sk)
+{
+	unsigned long val = (unsigned long)sk->sk_memcg;
+
+	return val & SK_MEMCG_FLAG_MASK;
 }
 
 static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
@@ -2609,7 +2635,7 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
 
 static inline bool mem_cgroup_sk_exclusive(const struct sock *sk)
 {
-	return false;
+	return mem_cgroup_sk_get_flags(sk) & SK_MEMCG_EXCLUSIVE;
 }
 
 static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
@@ -2634,6 +2660,15 @@ static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
 	return NULL;
 }
 
+static inline void mem_cgroup_sk_set_flags(struct sock *sk, unsigned short flags)
+{
+}
+
+static inline unsigned short mem_cgroup_sk_get_flags(const struct sock *sk)
+{
+	return 0;
+}
+
 static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
 {
 	return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df3e9205c9e6..88028af8ac28 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4995,6 +4995,16 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
 DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
 EXPORT_SYMBOL(memcg_sockets_enabled_key);
 
+static void mem_cgroup_sk_set(struct sock *sk, struct mem_cgroup *memcg)
+{
+	sk->sk_memcg = memcg;
+
+#ifdef CONFIG_NET
+	if (READ_ONCE(sock_net(sk)->core.sysctl_memcg_exclusive))
+		mem_cgroup_sk_set_flags(sk, SK_MEMCG_EXCLUSIVE);
+#endif
+}
+
 void mem_cgroup_sk_alloc(struct sock *sk)
 {
 	struct mem_cgroup *memcg;
@@ -5013,7 +5023,7 @@ void mem_cgroup_sk_alloc(struct sock *sk)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg1_tcpmem_active(memcg))
 		goto out;
 	if (css_tryget(&memcg->css))
-		sk->sk_memcg = memcg;
+		mem_cgroup_sk_set(sk, memcg);
 out:
 	rcu_read_unlock();
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 814966309b0e..348e599c3fbc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2519,6 +2519,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 #ifdef CONFIG_MEMCG
 	/* sk->sk_memcg will be populated at accept() time */
 	newsk->sk_memcg = NULL;
+	mem_cgroup_sk_set_flags(newsk, mem_cgroup_sk_get_flags(sk));
 #endif
 
 	cgroup_sk_clone(&newsk->sk_cgrp_data);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 8cf04b57ade1..c8b5fc3b8435 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -676,6 +676,17 @@ static struct ctl_table netns_core_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE
 	},
+#ifdef CONFIG_MEMCG
+	{
+		.procname	= "memcg_exclusive",
+		.data		= &init_net.core.sysctl_memcg_exclusive,
+		.maxlen		= sizeof(u8),
+		.mode		= 0644,
+		.proc_handler	= proc_dou8vec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE
+	},
+#endif
 	/* sysctl_core_net_init() will set the values after this
 	 * to readonly in network namespaces
 	 */
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index c410eb525ebe..be27de94143c 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -758,12 +758,16 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
 	if (mem_cgroup_sockets_enabled &&
 	    (!IS_ENABLED(CONFIG_IP_SCTP) || sk_is_tcp(newsk))) {
 		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
+		unsigned short flags;
 
+		flags = mem_cgroup_sk_get_flags(newsk);
 		mem_cgroup_sk_alloc(newsk);
 
 		if (mem_cgroup_from_sk(newsk)) {
 			int amt;
 
+			mem_cgroup_sk_set_flags(newsk, flags);
+
 			/* The socket has not been accepted yet, no need
 			 * to look at newsk->sk_wmem_queued.
 			 */
-- 
2.51.0.470.ga7dc726c21-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v10 bpf-next/net 4/6] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE.
  2025-09-20  0:07 [PATCH v10 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl Kuniyuki Iwashima
@ 2025-09-20  0:07 ` Kuniyuki Iwashima
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 5/6] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_EXCLUSIVE Kuniyuki Iwashima
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE Kuniyuki Iwashima
  5 siblings, 0 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-20  0:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

We will support flagging SK_MEMCG_EXCLUSIVE via bpf_setsockopt() at
the BPF_CGROUP_INET_SOCK_CREATE hook.

BPF_CGROUP_INET_SOCK_CREATE is invoked by __cgroup_bpf_run_filter_sk()
that passes a pointer to struct sock to the bpf prog as void *ctx.

But there are no bpf_func_proto for bpf_setsockopt() that receives
the ctx as a pointer to struct sock.

Also, bpf_getsockopt() will be necessary for a cgroup with multiple
bpf progs running.

Let's add new bpf_setsockopt() and bpf_getsockopt() variants for
BPF_CGROUP_INET_SOCK_CREATE.

Note that inet_create() is not under lock_sock() and has the same
semantics with bpf_lsm_unlocked_sockopt_hooks.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v5: Rename new variants to bpf_sock_create_{get,set}sockopt().
v4:
  * Use __bpf_setsockopt() instead of _bpf_setsockopt()
  * Add getsockopt() for a cgroup with multiple bpf progs running
v3: Remove bpf_func_proto for accept()
v2: Make 2 new bpf_func_proto static
---
 net/core/filter.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 63f3baee2daf..31b259f02ee9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5723,6 +5723,40 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
 	.arg5_type	= ARG_CONST_SIZE,
 };
 
+BPF_CALL_5(bpf_sock_create_setsockopt, struct sock *, sk, int, level,
+	   int, optname, char *, optval, int, optlen)
+{
+	return __bpf_setsockopt(sk, level, optname, optval, optlen);
+}
+
+static const struct bpf_func_proto bpf_sock_create_setsockopt_proto = {
+	.func		= bpf_sock_create_setsockopt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_PTR_TO_MEM | MEM_RDONLY,
+	.arg5_type	= ARG_CONST_SIZE,
+};
+
+BPF_CALL_5(bpf_sock_create_getsockopt, struct sock *, sk, int, level,
+	   int, optname, char *, optval, int, optlen)
+{
+	return __bpf_getsockopt(sk, level, optname, optval, optlen);
+}
+
+static const struct bpf_func_proto bpf_sock_create_getsockopt_proto = {
+	.func		= bpf_sock_create_getsockopt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg5_type	= ARG_CONST_SIZE,
+};
+
 BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 	   int, level, int, optname, char *, optval, int, optlen)
 {
@@ -8051,6 +8085,20 @@ sock_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_sk_storage_get_cg_sock_proto;
 	case BPF_FUNC_ktime_get_coarse_ns:
 		return &bpf_ktime_get_coarse_ns_proto;
+	case BPF_FUNC_setsockopt:
+		switch (prog->expected_attach_type) {
+		case BPF_CGROUP_INET_SOCK_CREATE:
+			return &bpf_sock_create_setsockopt_proto;
+		default:
+			return NULL;
+		}
+	case BPF_FUNC_getsockopt:
+		switch (prog->expected_attach_type) {
+		case BPF_CGROUP_INET_SOCK_CREATE:
+			return &bpf_sock_create_getsockopt_proto;
+		default:
+			return NULL;
+		}
 	default:
 		return bpf_base_func_proto(func_id, prog);
 	}
-- 
2.51.0.470.ga7dc726c21-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v10 bpf-next/net 5/6] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_EXCLUSIVE.
  2025-09-20  0:07 [PATCH v10 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 4/6] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE Kuniyuki Iwashima
@ 2025-09-20  0:07 ` Kuniyuki Iwashima
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE Kuniyuki Iwashima
  5 siblings, 0 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-20  0:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

If a socket has sk->sk_memcg with SK_MEMCG_EXCLUSIVE, it is decoupled
from the global protocol memory accounting.

This is controlled by net.core.memcg_exclusive sysctl, but it lacks
flexibility.

Let's support flagging (and clearing) SK_MEMCG_EXCLUSIVE via
bpf_setsockopt() at the BPF_CGROUP_INET_SOCK_CREATE hook.

  u32 flags = SK_BPF_MEMCG_EXCLUSIVE;

  bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS,
                 &flags, sizeof(flags));

As with net.core.memcg_exclusive, this is inherited to child sockets,
and BPF always takes precedence over sysctl at socket(2) and accept(2).

SK_BPF_MEMCG_FLAGS is only supported at BPF_CGROUP_INET_SOCK_CREATE
and not supported on other hooks for some reasons:

  1. UDP charges memory under sk->sk_receive_queue.lock instead
     of lock_sock()

  2. For TCP child sockets, memory accounting is adjusted only in
     __inet_accept() which sk->sk_memcg allocation is deferred to

  3. Modifying the flag after skb is charged to sk requires such
     adjustment during bpf_setsockopt() and complicates the logic
     unnecessarily

We can support other hooks later if a real use case justifies that.

Most changes are inline and hard to trace, but a microbenchmark on
__sk_mem_raise_allocated() during neper/tcp_stream showed that more
samples completed faster with SK_MEMCG_EXCLUSIVE.  This will be more
visible under tcp_mem pressure.

  # bpftrace -e 'kprobe:__sk_mem_raise_allocated { @start[tid] = nsecs; }
    kretprobe:__sk_mem_raise_allocated /@start[tid]/
    { @end[tid] = nsecs - @start[tid]; @times = hist(@end[tid]); delete(@start[tid]); }'
  # tcp_stream -6 -F 1000 -N -T 256

Without bpf prog:

  [128, 256)          3846 |                                                    |
  [256, 512)       1505326 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [512, 1K)        1371006 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     |
  [1K, 2K)          198207 |@@@@@@                                              |
  [2K, 4K)           31199 |@                                                   |

With bpf prog in the next patch:
  (must be attached before tcp_stream)
  # bpftool prog load sk_memcg.bpf.o /sys/fs/bpf/sk_memcg type cgroup/sock_create
  # bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_create pinned /sys/fs/bpf/sk_memcg

  [128, 256)          6413 |                                                    |
  [256, 512)       1868425 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [512, 1K)        1101697 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
  [1K, 2K)          117031 |@@@@                                                |
  [2K, 4K)           11773 |                                                    |

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v7:
  * Update commit message.

v5:
  * Limit getsockopt() to BPF_CGROUP_INET_SOCK_CREATE

v4:
  * Only allow inet_create() to set flags
  * Inherit flags from listener to child in sk_clone_lock()
  * Support clearing flags

v3:
  * Allow setting flags without sk->sk_memcg in sk_bpf_set_get_memcg_flags()
  * Preserve flags in __inet_accept()

v2:
  * s/mem_cgroup_sk_set_flag/mem_cgroup_sk_set_flags/ when CONFIG_MEMCG=n
  * Use CONFIG_CGROUP_BPF instead of CONFIG_BPF_SYSCALL for ifdef
---
 include/uapi/linux/bpf.h       |  6 ++++++
 mm/memcontrol.c                |  3 +++
 net/core/filter.c              | 34 ++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  6 ++++++
 4 files changed, 49 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 233de8677382..35e3ce40ac90 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7182,6 +7182,7 @@ enum {
 	TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
 	TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
 	SK_BPF_CB_FLAGS		= 1009, /* Get or set sock ops flags in socket */
+	SK_BPF_MEMCG_FLAGS	= 1010, /* Get or Set flags saved in sk->sk_memcg */
 };
 
 enum {
@@ -7204,6 +7205,11 @@ enum {
 						 */
 };
 
+enum {
+	SK_BPF_MEMCG_EXCLUSIVE	= (1UL << 0),
+	SK_BPF_MEMCG_FLAG_MAX	= (1UL << 1),
+};
+
 struct bpf_perf_event_value {
 	__u64 counter;
 	__u64 enabled;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 88028af8ac28..b7d405b57e23 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4997,6 +4997,9 @@ EXPORT_SYMBOL(memcg_sockets_enabled_key);
 
 static void mem_cgroup_sk_set(struct sock *sk, struct mem_cgroup *memcg)
 {
+	BUILD_BUG_ON((unsigned short)SK_MEMCG_EXCLUSIVE != SK_BPF_MEMCG_EXCLUSIVE);
+	BUILD_BUG_ON((unsigned short)SK_MEMCG_FLAG_MAX != SK_BPF_MEMCG_FLAG_MAX);
+
 	sk->sk_memcg = memcg;
 
 #ifdef CONFIG_NET
diff --git a/net/core/filter.c b/net/core/filter.c
index 31b259f02ee9..df2496120076 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5723,9 +5723,39 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
 	.arg5_type	= ARG_CONST_SIZE,
 };
 
+static int sk_bpf_set_get_memcg_flags(struct sock *sk,
+				      char *optval, int optlen,
+				      bool getopt)
+{
+	u32 flags;
+
+	if (optlen != sizeof(u32))
+		return -EINVAL;
+
+	if (!sk_has_account(sk))
+		return -EOPNOTSUPP;
+
+	if (getopt) {
+		*(u32 *)optval = mem_cgroup_sk_get_flags(sk);
+		return 0;
+	}
+
+	flags = *(u32 *)optval;
+	if (flags >= SK_BPF_MEMCG_FLAG_MAX)
+		return -EINVAL;
+
+	mem_cgroup_sk_set_flags(sk, flags);
+
+	return 0;
+}
+
 BPF_CALL_5(bpf_sock_create_setsockopt, struct sock *, sk, int, level,
 	   int, optname, char *, optval, int, optlen)
 {
+	if (IS_ENABLED(CONFIG_MEMCG) &&
+	    level == SOL_SOCKET && optname == SK_BPF_MEMCG_FLAGS)
+		return sk_bpf_set_get_memcg_flags(sk, optval, optlen, false);
+
 	return __bpf_setsockopt(sk, level, optname, optval, optlen);
 }
 
@@ -5743,6 +5773,10 @@ static const struct bpf_func_proto bpf_sock_create_setsockopt_proto = {
 BPF_CALL_5(bpf_sock_create_getsockopt, struct sock *, sk, int, level,
 	   int, optname, char *, optval, int, optlen)
 {
+	if (IS_ENABLED(CONFIG_MEMCG) &&
+	    level == SOL_SOCKET && optname == SK_BPF_MEMCG_FLAGS)
+		return sk_bpf_set_get_memcg_flags(sk, optval, optlen, true);
+
 	return __bpf_getsockopt(sk, level, optname, optval, optlen);
 }
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 233de8677382..35e3ce40ac90 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7182,6 +7182,7 @@ enum {
 	TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
 	TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
 	SK_BPF_CB_FLAGS		= 1009, /* Get or set sock ops flags in socket */
+	SK_BPF_MEMCG_FLAGS	= 1010, /* Get or Set flags saved in sk->sk_memcg */
 };
 
 enum {
@@ -7204,6 +7205,11 @@ enum {
 						 */
 };
 
+enum {
+	SK_BPF_MEMCG_EXCLUSIVE	= (1UL << 0),
+	SK_BPF_MEMCG_FLAG_MAX	= (1UL << 1),
+};
+
 struct bpf_perf_event_value {
 	__u64 counter;
 	__u64 enabled;
-- 
2.51.0.470.ga7dc726c21-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v10 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE.
  2025-09-20  0:07 [PATCH v10 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 5/6] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_EXCLUSIVE Kuniyuki Iwashima
@ 2025-09-20  0:07 ` Kuniyuki Iwashima
  5 siblings, 0 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-20  0:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

The test does the following for IPv4/IPv6 x TCP/UDP sockets
with/without SK_MEMCG_EXCLUSIVE, which can be turned on by
net.core.memcg_exclusive or bpf_setsockopt(SK_BPF_MEMCG_EXCLUSIVE).

  1. Create socket pairs
  2. Send NR_PAGES (32) of data (TCP consumes around 35 pages,
     and UDP consuems 66 pages due to skb overhead)
  3. Read memory_allocated from sk->sk_prot->memory_allocated and
     sk->sk_prot->memory_per_cpu_fw_alloc
  4. Check if unread data is charged to memory_allocated

If SK_MEMCG_EXCLUSIVE is set, memory_allocated should not be
changed, but we allow a small error (up to 10 pages) in case
other processes on the host use some amounts of TCP/UDP memory.

The amount of allocated pages are buffered to per-cpu variable
{tcp,udp}_memory_per_cpu_fw_alloc up to +/- net.core.mem_pcpu_rsv
before reported to {tcp,udp}_memory_allocated.

At 3., memory_allocated is calculated from the 2 variables at
fentry of socket create function.

We drain the receive queue only for UDP before close() because UDP
recv queue is destroyed after RCU grace period.  When I printed
memory_allocated, UDP exclusive cases sometimes saw the non-exclusive
case's leftover, but it's still in the small error range (<10 pages).

  bpf_trace_printk: memory_allocated: 0   <-- TCP non-exclusive
  bpf_trace_printk: memory_allocated: 35
  bpf_trace_printk: memory_allocated: 0   <-- TCP w/ sysctl
  bpf_trace_printk: memory_allocated: 0
  bpf_trace_printk: memory_allocated: 0   <-- TCP w/ bpf
  bpf_trace_printk: memory_allocated: 0
  bpf_trace_printk: memory_allocated: 0   <-- UDP non-exclusive
  bpf_trace_printk: memory_allocated: 66
  bpf_trace_printk: memory_allocated: 2   <-- UDP w/ sysctl (2 pages leftover)
  bpf_trace_printk: memory_allocated: 2
  bpf_trace_printk: memory_allocated: 2   <-- UDP w/ bpf (2 pages leftover)
  bpf_trace_printk: memory_allocated: 2

We prefer finishing tests faster than oversleeping for call_rcu()
 + sk_destruct().

The test completes within 2s on QEMU (64 CPUs) w/ KVM.

  # time ./test_progs -t sk_memcg
  #370/1   sk_memcg/TCP  :OK
  #370/2   sk_memcg/UDP  :OK
  #370/3   sk_memcg/TCPv6:OK
  #370/4   sk_memcg/UDPv6:OK
  #370     sk_memcg:OK
  Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED

  real	0m1.609s
  user	0m0.167s
  sys	0m0.461s

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v10:
  * Clean up iterations in ->create_sockets()
  * Move ASSERT inside if to avoid noisy log in ->create_sockets()
  * Make run_test() static
  * Remove fexit progs & possible infinite loop in get_memory_allocated()
  * Use test__start_subtest() with if
  * Speed up test by reducing NR_PAGES to 32 and draing UDP queue
    instead of oversleeping for call_rcu()+sk_destruct()

v7:
  * Add test for sysctl

v6:
  * Trace sk_prot->memory_allocated + sk_prot->memory_per_cpu_fw_alloc

v5:
  * Use kern_sync_rcu()
  * Double NR_SEND to 128

v4:
  * Only use inet_create() hook
  * Test bpf_getsockopt()
  * Add serial_ prefix
  * Reduce sleep() and the amount of sent data
---
 .../selftests/bpf/prog_tests/sk_memcg.c       | 282 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/sk_memcg.c  | 105 +++++++
 2 files changed, 387 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_memcg.c
 create mode 100644 tools/testing/selftests/bpf/progs/sk_memcg.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sk_memcg.c b/tools/testing/selftests/bpf/prog_tests/sk_memcg.c
new file mode 100644
index 000000000000..9e0aa7e7df27
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sk_memcg.c
@@ -0,0 +1,282 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2025 Google LLC */
+
+#include <test_progs.h>
+#include "sk_memcg.skel.h"
+#include "network_helpers.h"
+
+#define NR_PAGES	32
+#define NR_SOCKETS	2
+#define BUF_TOTAL	(NR_PAGES * 4096 / NR_SOCKETS)
+#define BUF_SINGLE	1024
+#define NR_SEND		(BUF_TOTAL / BUF_SINGLE)
+
+struct test_case {
+	char name[8];
+	int family;
+	int type;
+	int (*create_sockets)(struct test_case *test_case, int sk[], int len);
+	long (*get_memory_allocated)(struct test_case *test_case, struct sk_memcg *skel);
+};
+
+static int tcp_create_sockets(struct test_case *test_case, int sk[], int len)
+{
+	int server, i;
+
+	server = start_server(test_case->family, test_case->type, NULL, 0, 0);
+	ASSERT_GE(server, 0, "start_server_str");
+
+	/* Keep for-loop so we can change NR_SOCKETS easily. */
+	for (i = 0; i < len; i += 2) {
+		sk[i] = connect_to_fd(server, 0);
+		if (sk[i] < 0) {
+			ASSERT_GE(sk[i], 0, "connect_to_fd");
+			return sk[i];
+		}
+
+		sk[i + 1] = accept(server, NULL, NULL);
+		if (sk[i + 1] < 0) {
+			ASSERT_GE(sk[i + 1], 0, "accept");
+			return sk[i + 1];
+		}
+	}
+
+	close(server);
+
+	return 0;
+}
+
+static int udp_create_sockets(struct test_case *test_case, int sk[], int len)
+{
+	int i, j, err, rcvbuf = BUF_TOTAL;
+
+	/* Keep for-loop so we can change NR_SOCKETS easily. */
+	for (i = 0; i < len; i += 2) {
+		sk[i] = start_server(test_case->family, test_case->type, NULL, 0, 0);
+		if (sk[i] < 0) {
+			ASSERT_GE(sk[i], 0, "start_server");
+			return sk[i];
+		}
+
+		sk[i + 1] = connect_to_fd(sk[i], 0);
+		if (sk[i + 1] < 0) {
+			ASSERT_GE(sk[i + 1], 0, "connect_to_fd");
+			return sk[i + 1];
+		}
+
+		err = connect_fd_to_fd(sk[i], sk[i + 1], 0);
+		if (err) {
+			ASSERT_EQ(err, 0, "connect_fd_to_fd");
+			return err;
+		}
+
+		for (j = 0; j < 2; j++) {
+			err = setsockopt(sk[i + j], SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(int));
+			if (err) {
+				ASSERT_EQ(err, 0, "setsockopt(SO_RCVBUF)");
+				return err;
+			}
+		}
+	}
+
+	return 0;
+}
+
+static long get_memory_allocated(struct test_case *test_case,
+				 bool *activated, long *memory_allocated)
+{
+	int sk;
+
+	*activated = true;
+
+	/* AF_INET and AF_INET6 share the same memory_allocated.
+	 * tcp_init_sock() is called by AF_INET and AF_INET6,
+	 * but udp_lib_init_sock() is inline.
+	 */
+	sk = socket(AF_INET, test_case->type, 0);
+	if (!ASSERT_GE(sk, 0, "get_memory_allocated"))
+		return -1;
+
+	close(sk);
+
+	return *memory_allocated;
+}
+
+static long tcp_get_memory_allocated(struct test_case *test_case, struct sk_memcg *skel)
+{
+	return get_memory_allocated(test_case,
+				    &skel->bss->tcp_activated,
+				    &skel->bss->tcp_memory_allocated);
+}
+
+static long udp_get_memory_allocated(struct test_case *test_case, struct sk_memcg *skel)
+{
+	return get_memory_allocated(test_case,
+				    &skel->bss->udp_activated,
+				    &skel->bss->udp_memory_allocated);
+}
+
+static int check_exclusive(struct test_case *test_case,
+			   struct sk_memcg *skel, bool exclusive)
+{
+	char buf[BUF_SINGLE] = {};
+	long memory_allocated[2];
+	int sk[NR_SOCKETS] = {};
+	int err, i, j;
+
+	err = test_case->create_sockets(test_case, sk, ARRAY_SIZE(sk));
+	if (err)
+		goto close;
+
+	memory_allocated[0] = test_case->get_memory_allocated(test_case, skel);
+
+	/* allocate pages >= NR_PAGES */
+	for (i = 0; i < ARRAY_SIZE(sk); i++) {
+		for (j = 0; j < NR_SEND; j++) {
+			int bytes = send(sk[i], buf, sizeof(buf), 0);
+
+			/* Avoid too noisy logs when something failed. */
+			if (bytes != sizeof(buf)) {
+				ASSERT_EQ(bytes, sizeof(buf), "send");
+				if (bytes < 0) {
+					err = bytes;
+					goto drain;
+				}
+			}
+		}
+	}
+
+	memory_allocated[1] = test_case->get_memory_allocated(test_case, skel);
+
+	if (exclusive)
+		ASSERT_LE(memory_allocated[1], memory_allocated[0] + 10, "exclusive");
+	else
+		ASSERT_GT(memory_allocated[1], memory_allocated[0] + NR_PAGES, "not exclusive");
+
+drain:
+	if (test_case->type == SOCK_DGRAM) {
+		/* UDP starts purging sk->sk_receive_queue after one RCU
+		 * grace period, then udp_memory_allocated goes down,
+		 * so drain the queue before close().
+		 */
+		for (i = 0; i < ARRAY_SIZE(sk); i++) {
+			for (j = 0; j < NR_SEND; j++) {
+				int bytes = recv(sk[i], buf, 1, MSG_DONTWAIT | MSG_TRUNC);
+
+				if (bytes == sizeof(buf))
+					continue;
+				if (bytes != -1 || errno != EAGAIN)
+					PRINT_FAIL("bytes: %d, errno: %s\n", bytes, strerror(errno));
+				break;
+			}
+		}
+	}
+
+close:
+	for (i = 0; i < ARRAY_SIZE(sk); i++)
+		close(sk[i]);
+
+	return err;
+}
+
+static void run_test(struct test_case *test_case)
+{
+	struct nstoken *nstoken;
+	struct sk_memcg *skel;
+	int cgroup, err;
+
+	skel = sk_memcg__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	skel->bss->nr_cpus = libbpf_num_possible_cpus();
+
+	err = sk_memcg__attach(skel);
+	if (!ASSERT_OK(err, "attach"))
+		goto destroy_skel;
+
+	cgroup = test__join_cgroup("/sk_memcg");
+	if (!ASSERT_GE(cgroup, 0, "join_cgroup"))
+		goto destroy_skel;
+
+	err = make_netns("sk_memcg");
+	if (!ASSERT_EQ(err, 0, "make_netns"))
+		goto close_cgroup;
+
+	nstoken = open_netns("sk_memcg");
+	if (!ASSERT_OK_PTR(nstoken, "open_netns"))
+		goto remove_netns;
+
+	err = check_exclusive(test_case, skel, false);
+	if (!ASSERT_EQ(err, 0, "test_exclusive(false)"))
+		goto close_netns;
+
+	err = write_sysctl("/proc/sys/net/core/memcg_exclusive", "1");
+	if (!ASSERT_EQ(err, 0, "write_sysctl(1)"))
+		goto close_netns;
+
+	err = check_exclusive(test_case, skel, true);
+	if (!ASSERT_EQ(err, 0, "test_exclusive(true by sysctl)"))
+		goto close_netns;
+
+	err = write_sysctl("/proc/sys/net/core/memcg_exclusive", "0");
+	if (!ASSERT_EQ(err, 0, "write_sysctl(0)"))
+		goto close_netns;
+
+	skel->links.sock_create = bpf_program__attach_cgroup(skel->progs.sock_create, cgroup);
+	if (!ASSERT_OK_PTR(skel->links.sock_create, "attach_cgroup(sock_create)"))
+		goto close_netns;
+
+	err = check_exclusive(test_case, skel, true);
+	ASSERT_EQ(err, 0, "test_exclusive(true by bpf)");
+
+close_netns:
+	close_netns(nstoken);
+remove_netns:
+	remove_netns("sk_memcg");
+close_cgroup:
+	close(cgroup);
+destroy_skel:
+	sk_memcg__destroy(skel);
+}
+
+struct test_case test_cases[] = {
+	{
+		.name = "TCP  ",
+		.family = AF_INET,
+		.type = SOCK_STREAM,
+		.create_sockets = tcp_create_sockets,
+		.get_memory_allocated = tcp_get_memory_allocated,
+	},
+	{
+		.name = "UDP  ",
+		.family = AF_INET,
+		.type = SOCK_DGRAM,
+		.create_sockets = udp_create_sockets,
+		.get_memory_allocated = udp_get_memory_allocated,
+	},
+	{
+		.name = "TCPv6",
+		.family = AF_INET6,
+		.type = SOCK_STREAM,
+		.create_sockets = tcp_create_sockets,
+		.get_memory_allocated = tcp_get_memory_allocated,
+	},
+	{
+		.name = "UDPv6",
+		.family = AF_INET6,
+		.type = SOCK_DGRAM,
+		.create_sockets = udp_create_sockets,
+		.get_memory_allocated = udp_get_memory_allocated,
+	},
+};
+
+void serial_test_sk_memcg(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
+		if (test__start_subtest(test_cases[i].name))
+			run_test(&test_cases[i]);
+	}
+}
diff --git a/tools/testing/selftests/bpf/progs/sk_memcg.c b/tools/testing/selftests/bpf/progs/sk_memcg.c
new file mode 100644
index 000000000000..8430ad920224
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/sk_memcg.c
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2025 Google LLC */
+
+#include "bpf_tracing_net.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <errno.h>
+
+extern int tcp_memory_per_cpu_fw_alloc __ksym;
+extern int udp_memory_per_cpu_fw_alloc __ksym;
+
+int nr_cpus;
+bool tcp_activated, udp_activated;
+long tcp_memory_allocated, udp_memory_allocated;
+
+struct sk_prot {
+	long *memory_allocated;
+	int *memory_per_cpu_fw_alloc;
+};
+
+static int drain_memory_per_cpu_fw_alloc(__u32 i, struct sk_prot *sk_prot_ctx)
+{
+	int *memory_per_cpu_fw_alloc;
+
+	memory_per_cpu_fw_alloc = bpf_per_cpu_ptr(sk_prot_ctx->memory_per_cpu_fw_alloc, i);
+	if (memory_per_cpu_fw_alloc)
+		*sk_prot_ctx->memory_allocated += *memory_per_cpu_fw_alloc;
+
+	return 0;
+}
+
+static long get_memory_allocated(struct sock *_sk, int *memory_per_cpu_fw_alloc)
+{
+	struct sock *sk = bpf_core_cast(_sk, struct sock);
+	struct sk_prot sk_prot_ctx;
+	long memory_allocated;
+
+	/* net_aligned_data.{tcp,udp}_memory_allocated was not available. */
+	memory_allocated = sk->__sk_common.skc_prot->memory_allocated->counter;
+
+	sk_prot_ctx.memory_allocated = &memory_allocated;
+	sk_prot_ctx.memory_per_cpu_fw_alloc = memory_per_cpu_fw_alloc;
+
+	bpf_loop(nr_cpus, drain_memory_per_cpu_fw_alloc, &sk_prot_ctx, 0);
+
+	return memory_allocated;
+}
+
+static void fentry_init_sock(struct sock *sk, bool *activated,
+			     long *memory_allocated, int *memory_per_cpu_fw_alloc)
+{
+	if (!*activated)
+		return;
+
+	*memory_allocated = get_memory_allocated(sk, memory_per_cpu_fw_alloc);
+	*activated = false;
+}
+
+SEC("fentry/tcp_init_sock")
+int BPF_PROG(fentry_tcp_init_sock, struct sock *sk)
+{
+	fentry_init_sock(sk, &tcp_activated,
+			 &tcp_memory_allocated, &tcp_memory_per_cpu_fw_alloc);
+	return 0;
+}
+
+SEC("fentry/udp_init_sock")
+int BPF_PROG(fentry_udp_init_sock, struct sock *sk)
+{
+	fentry_init_sock(sk, &udp_activated,
+			 &udp_memory_allocated, &udp_memory_per_cpu_fw_alloc);
+	return 0;
+}
+
+SEC("cgroup/sock_create")
+int sock_create(struct bpf_sock *ctx)
+{
+	u32 flags = SK_BPF_MEMCG_EXCLUSIVE;
+	int err;
+
+	err = bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS,
+			     &flags, sizeof(flags));
+	if (err)
+		goto err;
+
+	flags = 0;
+
+	err = bpf_getsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS,
+			     &flags, sizeof(flags));
+	if (err)
+		goto err;
+
+	if (flags != SK_BPF_MEMCG_EXCLUSIVE) {
+		err = -EINVAL;
+		goto err;
+	}
+
+	return 1;
+
+err:
+	bpf_set_retval(err);
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "GPL";
-- 
2.51.0.470.ga7dc726c21-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v10 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting.
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
@ 2025-09-20  5:25   ` Shakeel Butt
  2025-09-25 22:40     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2025-09-20  5:25 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, John Fastabend, Stanislav Fomichev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Willem de Bruijn, Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On Sat, Sep 20, 2025 at 12:07:16AM +0000, Kuniyuki Iwashima wrote:
> Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> buffers and charge memory to per-protocol global counters pointed to by
> sk->sk_proto->memory_allocated.
> 
> If a socket has sk->sk_memcg, this memory is also charged to memcg as
> "sock" in memory.stat.
> 
> We do not need to pay costs for two orthogonal memory accounting
> mechanisms.  A microbenchmark result is in the subsequent bpf patch.
> 
> Let's decouple sockets under memcg from the global per-protocol memory
> accounting if mem_cgroup_sk_exclusive() returns true.
> 
> Note that this does NOT disable memcg, but rather the per-protocol one.
> 
> mem_cgroup_sk_exclusive() starts to return true in the following patches,
> and then, the per-protocol memory accounting will be skipped.
> 
> In __inet_accept(), we need to reclaim counts that are already charged
> for child sockets because we do not allocate sk->sk_memcg until accept().
> 
> trace_sock_exceed_buf_limit() will always show 0 as accounted for the
> memcg-exclusive sockets, but this can be obtained in memory.stat.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> Nacked-by: Johannes Weiner <hannes@cmpxchg.org>

This looks good to me now, let's ask Johannes to take a look again and if
he still has any concerns.

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v10 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl.
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl Kuniyuki Iwashima
@ 2025-09-20  5:35   ` Shakeel Butt
  2025-09-20  6:22     ` Kuniyuki Iwashima
  2025-09-23  0:54   ` Shakeel Butt
  2025-09-25 22:52   ` Shakeel Butt
  2 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2025-09-20  5:35 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, John Fastabend, Stanislav Fomichev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Willem de Bruijn, Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On Sat, Sep 20, 2025 at 12:07:17AM +0000, Kuniyuki Iwashima wrote:
> If net.core.memcg_exclusive is 1 when sk->sk_memcg is allocated,
> the socket is flagged with SK_MEMCG_EXCLUSIVE internally and skips
> the global per-protocol memory accounting.
> 
> OTOH, for accept()ed child sockets, this flag is inherited from
> the listening socket in sk_clone_lock() and set in __inet_accept().
> This is to preserve the decision by BPF which will be supported later.
> 
> Given sk->sk_memcg can be accessed in the fast path, it would
> be preferable to place the flag field in the same cache line as
> sk->sk_memcg.
> 
> However, struct sock does not have such a 1-byte hole.
> 
> Let's store the flag in the lowest bit of sk->sk_memcg and check
> it in mem_cgroup_sk_exclusive().
> 
> Tested with a script that creates local socket pairs and send()s a
> bunch of data without recv()ing.
> 
> Setup:
> 
>   # mkdir /sys/fs/cgroup/test
>   # echo $$ >> /sys/fs/cgroup/test/cgroup.procs
>   # sysctl -q net.ipv4.tcp_mem="1000 1000 1000"
> 
> Without net.core.memcg_exclusive, charged to memcg & tcp_mem:
> 
>   # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
>   # cat /sys/fs/cgroup/test/memory.stat | grep sock
>   sock 22642688 <-------------------------------------- charged to memcg
>   # cat /proc/net/sockstat| grep TCP
>   TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 <-- charged to tcp_mem
>   # ss -tn | head -n 5
>   State Recv-Q Send-Q Local Address:Port  Peer Address:Port
>   ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53188
>   ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:49972
>   ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53868
>   ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53554
>   # nstat | grep Pressure || echo no pressure
>   TcpExtTCPMemoryPressures        1                  0.0
> 
> With net.core.memcg_exclusive=1, only charged to memcg:
> 
>   # sysctl -q net.core.memcg_exclusive=1
>   # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
>   # cat /sys/fs/cgroup/test/memory.stat | grep sock
>   sock 2757468160 <------------------------------------ charged to memcg
>   # cat /proc/net/sockstat | grep TCP
>   TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 <- NOT charged to tcp_mem
>   # ss -tn | head -n 5
>   State Recv-Q Send-Q  Local Address:Port  Peer Address:Port
>   ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:49026
>   ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:45630
>   ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:44870
>   ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:45274
>   # nstat | grep Pressure || echo no pressure
>   no pressure
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> ---
> v8: Fix build failure when CONFIG_NET=n
> ---
>  Documentation/admin-guide/sysctl/net.rst |  9 ++++++
>  include/net/netns/core.h                 |  3 ++
>  include/net/sock.h                       | 39 ++++++++++++++++++++++--
>  mm/memcontrol.c                          | 12 +++++++-
>  net/core/sock.c                          |  1 +
>  net/core/sysctl_net_core.c               | 11 +++++++
>  net/ipv4/af_inet.c                       |  4 +++
>  7 files changed, 76 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> index 2ef50828aff1..7272194dcf45 100644
> --- a/Documentation/admin-guide/sysctl/net.rst
> +++ b/Documentation/admin-guide/sysctl/net.rst
> @@ -212,6 +212,15 @@ mem_pcpu_rsv
>  
>  Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU.
>  
> +memcg_exclusive
> +---------------
> +
> +Skip charging socket buffers to the per-protocol global memory accounting
> +(controlled by net.ipv4.tcp_mem, etc) if they are already charged to the
> +cgroup memory controller ("sock" in memory.stat file).
> +
> +Default: 0
> +
>  rmem_default
>  ------------
>  
> diff --git a/include/net/netns/core.h b/include/net/netns/core.h
> index 9b36f0ff0c20..ec511088e67d 100644
> --- a/include/net/netns/core.h
> +++ b/include/net/netns/core.h
> @@ -16,6 +16,9 @@ struct netns_core {
>  	int	sysctl_optmem_max;
>  	u8	sysctl_txrehash;
>  	u8	sysctl_tstamp_allow_data;
> +#ifdef CONFIG_MEMCG
> +	u8	sysctl_memcg_exclusive;
> +#endif

Hmm will this be a system level or namespace level sysctl? Seems like ns
level, any reason to go with netns level?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v10 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl.
  2025-09-20  5:35   ` Shakeel Butt
@ 2025-09-20  6:22     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-20  6:22 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, John Fastabend, Stanislav Fomichev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Willem de Bruijn, Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On Fri, Sep 19, 2025 at 10:36 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Sat, Sep 20, 2025 at 12:07:17AM +0000, Kuniyuki Iwashima wrote:
> > If net.core.memcg_exclusive is 1 when sk->sk_memcg is allocated,
> > the socket is flagged with SK_MEMCG_EXCLUSIVE internally and skips
> > the global per-protocol memory accounting.
> >
> > OTOH, for accept()ed child sockets, this flag is inherited from
> > the listening socket in sk_clone_lock() and set in __inet_accept().
> > This is to preserve the decision by BPF which will be supported later.
> >
> > Given sk->sk_memcg can be accessed in the fast path, it would
> > be preferable to place the flag field in the same cache line as
> > sk->sk_memcg.
> >
> > However, struct sock does not have such a 1-byte hole.
> >
> > Let's store the flag in the lowest bit of sk->sk_memcg and check
> > it in mem_cgroup_sk_exclusive().
> >
> > Tested with a script that creates local socket pairs and send()s a
> > bunch of data without recv()ing.
> >
> > Setup:
> >
> >   # mkdir /sys/fs/cgroup/test
> >   # echo $$ >> /sys/fs/cgroup/test/cgroup.procs
> >   # sysctl -q net.ipv4.tcp_mem="1000 1000 1000"
> >
> > Without net.core.memcg_exclusive, charged to memcg & tcp_mem:
> >
> >   # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
> >   # cat /sys/fs/cgroup/test/memory.stat | grep sock
> >   sock 22642688 <-------------------------------------- charged to memcg
> >   # cat /proc/net/sockstat| grep TCP
> >   TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 <-- charged to tcp_mem
> >   # ss -tn | head -n 5
> >   State Recv-Q Send-Q Local Address:Port  Peer Address:Port
> >   ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53188
> >   ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:49972
> >   ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53868
> >   ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53554
> >   # nstat | grep Pressure || echo no pressure
> >   TcpExtTCPMemoryPressures        1                  0.0
> >
> > With net.core.memcg_exclusive=1, only charged to memcg:
> >
> >   # sysctl -q net.core.memcg_exclusive=1
> >   # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
> >   # cat /sys/fs/cgroup/test/memory.stat | grep sock
> >   sock 2757468160 <------------------------------------ charged to memcg
> >   # cat /proc/net/sockstat | grep TCP
> >   TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 <- NOT charged to tcp_mem
> >   # ss -tn | head -n 5
> >   State Recv-Q Send-Q  Local Address:Port  Peer Address:Port
> >   ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:49026
> >   ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:45630
> >   ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:44870
> >   ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:45274
> >   # nstat | grep Pressure || echo no pressure
> >   no pressure
> >
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> > ---
> > v8: Fix build failure when CONFIG_NET=n
> > ---
> >  Documentation/admin-guide/sysctl/net.rst |  9 ++++++
> >  include/net/netns/core.h                 |  3 ++
> >  include/net/sock.h                       | 39 ++++++++++++++++++++++--
> >  mm/memcontrol.c                          | 12 +++++++-
> >  net/core/sock.c                          |  1 +
> >  net/core/sysctl_net_core.c               | 11 +++++++
> >  net/ipv4/af_inet.c                       |  4 +++
> >  7 files changed, 76 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > index 2ef50828aff1..7272194dcf45 100644
> > --- a/Documentation/admin-guide/sysctl/net.rst
> > +++ b/Documentation/admin-guide/sysctl/net.rst
> > @@ -212,6 +212,15 @@ mem_pcpu_rsv
> >
> >  Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU.
> >
> > +memcg_exclusive
> > +---------------
> > +
> > +Skip charging socket buffers to the per-protocol global memory accounting
> > +(controlled by net.ipv4.tcp_mem, etc) if they are already charged to the
> > +cgroup memory controller ("sock" in memory.stat file).
> > +
> > +Default: 0
> > +
> >  rmem_default
> >  ------------
> >
> > diff --git a/include/net/netns/core.h b/include/net/netns/core.h
> > index 9b36f0ff0c20..ec511088e67d 100644
> > --- a/include/net/netns/core.h
> > +++ b/include/net/netns/core.h
> > @@ -16,6 +16,9 @@ struct netns_core {
> >       int     sysctl_optmem_max;
> >       u8      sysctl_txrehash;
> >       u8      sysctl_tstamp_allow_data;
> > +#ifdef CONFIG_MEMCG
> > +     u8      sysctl_memcg_exclusive;
> > +#endif
>
> Hmm will this be a system level or namespace level sysctl? Seems like ns
> level, any reason to go with netns level?

This is netns-level, I don't have a specific reason other than
granularity and have no preference.  Anyway, either
system-level or netns-level require admin privilege.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v10 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl.
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl Kuniyuki Iwashima
  2025-09-20  5:35   ` Shakeel Butt
@ 2025-09-23  0:54   ` Shakeel Butt
  2025-09-23  1:03     ` Kuniyuki Iwashima
  2025-09-25 22:52   ` Shakeel Butt
  2 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2025-09-23  0:54 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, John Fastabend, Stanislav Fomichev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Willem de Bruijn, Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On Sat, Sep 20, 2025 at 12:07:17AM +0000, Kuniyuki Iwashima wrote:
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 814966309b0e..348e599c3fbc 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2519,6 +2519,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
>  #ifdef CONFIG_MEMCG
>  	/* sk->sk_memcg will be populated at accept() time */
>  	newsk->sk_memcg = NULL;
> +	mem_cgroup_sk_set_flags(newsk, mem_cgroup_sk_get_flags(sk));

Why do you need to set the flag here? Will doing in __inet_accept only
be too late i.e. protocol accounting would have happened?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v10 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl.
  2025-09-23  0:54   ` Shakeel Butt
@ 2025-09-23  1:03     ` Kuniyuki Iwashima
  2025-09-23  1:26       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-23  1:03 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, John Fastabend, Stanislav Fomichev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Willem de Bruijn, Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On Mon, Sep 22, 2025 at 5:54 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Sat, Sep 20, 2025 at 12:07:17AM +0000, Kuniyuki Iwashima wrote:
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index 814966309b0e..348e599c3fbc 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -2519,6 +2519,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
> >  #ifdef CONFIG_MEMCG
> >       /* sk->sk_memcg will be populated at accept() time */
> >       newsk->sk_memcg = NULL;
> > +     mem_cgroup_sk_set_flags(newsk, mem_cgroup_sk_get_flags(sk));
>
> Why do you need to set the flag here? Will doing in __inet_accept only
> be too late i.e. protocol accounting would have happened?

Currently, we only allow bpf_setsockopt() during socket(2) not
to make things complicated as explained in patch 4.

So, this is to preserve the listener's flag set by bpf_setsockopt()
since network applications basically assume setsockopt() for a
listener socket is inherited to its child sockets.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v10 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl.
  2025-09-23  1:03     ` Kuniyuki Iwashima
@ 2025-09-23  1:26       ` Kuniyuki Iwashima
  0 siblings, 0 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-23  1:26 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, John Fastabend, Stanislav Fomichev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Willem de Bruijn, Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On Mon, Sep 22, 2025 at 6:03 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> On Mon, Sep 22, 2025 at 5:54 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Sat, Sep 20, 2025 at 12:07:17AM +0000, Kuniyuki Iwashima wrote:
> > > diff --git a/net/core/sock.c b/net/core/sock.c
> > > index 814966309b0e..348e599c3fbc 100644
> > > --- a/net/core/sock.c
> > > +++ b/net/core/sock.c
> > > @@ -2519,6 +2519,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
> > >  #ifdef CONFIG_MEMCG
> > >       /* sk->sk_memcg will be populated at accept() time */
> > >       newsk->sk_memcg = NULL;
> > > +     mem_cgroup_sk_set_flags(newsk, mem_cgroup_sk_get_flags(sk));
> >
> > Why do you need to set the flag here? Will doing in __inet_accept only
> > be too late i.e. protocol accounting would have happened?

Oh no, this is because I used newsk in __inet_accept(), and
we can use sock->sk in __inet_accept() and remove the inheritance
in sk_clone_lock() like this,

flags = mem_cgroup_sk_get_flags(sock->sk);

but I think using newsk reduces one cache miss.  Also, we touch the
parent sk_memcg during sock_copy() in sk_clone_lock().

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v10 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting.
  2025-09-20  5:25   ` Shakeel Butt
@ 2025-09-25 22:40     ` Kuniyuki Iwashima
  2025-09-27 14:49       ` Martin KaFai Lau
  0 siblings, 1 reply; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-25 22:40 UTC (permalink / raw)
  To: Johannes Weiner, Shakeel Butt
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, John Fastabend, Stanislav Fomichev,
	Michal Hocko, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On Fri, Sep 19, 2025 at 10:25 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Sat, Sep 20, 2025 at 12:07:16AM +0000, Kuniyuki Iwashima wrote:
> > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > buffers and charge memory to per-protocol global counters pointed to by
> > sk->sk_proto->memory_allocated.
> >
> > If a socket has sk->sk_memcg, this memory is also charged to memcg as
> > "sock" in memory.stat.
> >
> > We do not need to pay costs for two orthogonal memory accounting
> > mechanisms.  A microbenchmark result is in the subsequent bpf patch.
> >
> > Let's decouple sockets under memcg from the global per-protocol memory
> > accounting if mem_cgroup_sk_exclusive() returns true.
> >
> > Note that this does NOT disable memcg, but rather the per-protocol one.
> >
> > mem_cgroup_sk_exclusive() starts to return true in the following patches,
> > and then, the per-protocol memory accounting will be skipped.
> >
> > In __inet_accept(), we need to reclaim counts that are already charged
> > for child sockets because we do not allocate sk->sk_memcg until accept().
> >
> > trace_sock_exceed_buf_limit() will always show 0 as accounted for the
> > memcg-exclusive sockets, but this can be obtained in memory.stat.
> >
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> > Nacked-by: Johannes Weiner <hannes@cmpxchg.org>
>
> This looks good to me now, let's ask Johannes to take a look again and if
> he still has any concerns.
>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>

Hi Johannes,

It would be really appreciated if you have a look again.

Thank you.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v10 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl.
  2025-09-20  0:07 ` [PATCH v10 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl Kuniyuki Iwashima
  2025-09-20  5:35   ` Shakeel Butt
  2025-09-23  0:54   ` Shakeel Butt
@ 2025-09-25 22:52   ` Shakeel Butt
  2 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2025-09-25 22:52 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, John Fastabend, Stanislav Fomichev,
	Johannes Weiner, Michal Hocko, Roman Gushchin, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Willem de Bruijn, Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On Sat, Sep 20, 2025 at 12:07:17AM +0000, Kuniyuki Iwashima wrote:
> If net.core.memcg_exclusive is 1 when sk->sk_memcg is allocated,
> the socket is flagged with SK_MEMCG_EXCLUSIVE internally and skips
> the global per-protocol memory accounting.
> 
> OTOH, for accept()ed child sockets, this flag is inherited from
> the listening socket in sk_clone_lock() and set in __inet_accept().
> This is to preserve the decision by BPF which will be supported later.
> 
> Given sk->sk_memcg can be accessed in the fast path, it would
> be preferable to place the flag field in the same cache line as
> sk->sk_memcg.
> 
> However, struct sock does not have such a 1-byte hole.
> 
> Let's store the flag in the lowest bit of sk->sk_memcg and check
> it in mem_cgroup_sk_exclusive().
> 
> Tested with a script that creates local socket pairs and send()s a
> bunch of data without recv()ing.
> 
> Setup:
> 
>   # mkdir /sys/fs/cgroup/test
>   # echo $$ >> /sys/fs/cgroup/test/cgroup.procs
>   # sysctl -q net.ipv4.tcp_mem="1000 1000 1000"
> 
> Without net.core.memcg_exclusive, charged to memcg & tcp_mem:
> 
>   # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
>   # cat /sys/fs/cgroup/test/memory.stat | grep sock
>   sock 22642688 <-------------------------------------- charged to memcg
>   # cat /proc/net/sockstat| grep TCP
>   TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 <-- charged to tcp_mem
>   # ss -tn | head -n 5
>   State Recv-Q Send-Q Local Address:Port  Peer Address:Port
>   ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53188
>   ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:49972
>   ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53868
>   ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53554
>   # nstat | grep Pressure || echo no pressure
>   TcpExtTCPMemoryPressures        1                  0.0
> 
> With net.core.memcg_exclusive=1, only charged to memcg:
> 
>   # sysctl -q net.core.memcg_exclusive=1
>   # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
>   # cat /sys/fs/cgroup/test/memory.stat | grep sock
>   sock 2757468160 <------------------------------------ charged to memcg
>   # cat /proc/net/sockstat | grep TCP
>   TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 <- NOT charged to tcp_mem
>   # ss -tn | head -n 5
>   State Recv-Q Send-Q  Local Address:Port  Peer Address:Port
>   ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:49026
>   ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:45630
>   ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:44870
>   ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:45274
>   # nstat | grep Pressure || echo no pressure
>   no pressure
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v10 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting.
  2025-09-25 22:40     ` Kuniyuki Iwashima
@ 2025-09-27 14:49       ` Martin KaFai Lau
  0 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2025-09-27 14:49 UTC (permalink / raw)
  To: Kuniyuki Iwashima, Johannes Weiner, Shakeel Butt
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, Michal Hocko, Roman Gushchin,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Neal Cardwell, Willem de Bruijn, Mina Almasry, Kuniyuki Iwashima,
	bpf, netdev

On 9/25/25 3:40 PM, Kuniyuki Iwashima wrote:
> On Fri, Sep 19, 2025 at 10:25 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>
>> On Sat, Sep 20, 2025 at 12:07:16AM +0000, Kuniyuki Iwashima wrote:
>>> Some protocols (e.g., TCP, UDP) implement memory accounting for socket
>>> buffers and charge memory to per-protocol global counters pointed to by
>>> sk->sk_proto->memory_allocated.
>>>
>>> If a socket has sk->sk_memcg, this memory is also charged to memcg as
>>> "sock" in memory.stat.
>>>
>>> We do not need to pay costs for two orthogonal memory accounting
>>> mechanisms.  A microbenchmark result is in the subsequent bpf patch.
>>>
>>> Let's decouple sockets under memcg from the global per-protocol memory
>>> accounting if mem_cgroup_sk_exclusive() returns true.
>>>
>>> Note that this does NOT disable memcg, but rather the per-protocol one.
>>>
>>> mem_cgroup_sk_exclusive() starts to return true in the following patches,
>>> and then, the per-protocol memory accounting will be skipped.
>>>
>>> In __inet_accept(), we need to reclaim counts that are already charged
>>> for child sockets because we do not allocate sk->sk_memcg until accept().
>>>
>>> trace_sock_exceed_buf_limit() will always show 0 as accounted for the
>>> memcg-exclusive sockets, but this can be obtained in memory.stat.
>>>
>>> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
>>> Nacked-by: Johannes Weiner <hannes@cmpxchg.org>

The patch cannot move on (i.e. land or respin) until the NACK concern is resolved.

>>
>> This looks good to me now, let's ask Johannes to take a look again and if
>> he still has any concerns.
>>
>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> 
> Hi Johannes,
> 
> It would be really appreciated if you have a look again.
> 

Hi Johannes,

Can you take a look if the clarified reason in the commit message has addressed 
the concern ?

Thanks,
Martin

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-09-27 14:50 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-20  0:07 [PATCH v10 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
2025-09-20  0:07 ` [PATCH v10 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
2025-09-20  0:07 ` [PATCH v10 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
2025-09-20  5:25   ` Shakeel Butt
2025-09-25 22:40     ` Kuniyuki Iwashima
2025-09-27 14:49       ` Martin KaFai Lau
2025-09-20  0:07 ` [PATCH v10 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl Kuniyuki Iwashima
2025-09-20  5:35   ` Shakeel Butt
2025-09-20  6:22     ` Kuniyuki Iwashima
2025-09-23  0:54   ` Shakeel Butt
2025-09-23  1:03     ` Kuniyuki Iwashima
2025-09-23  1:26       ` Kuniyuki Iwashima
2025-09-25 22:52   ` Shakeel Butt
2025-09-20  0:07 ` [PATCH v10 bpf-next/net 4/6] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE Kuniyuki Iwashima
2025-09-20  0:07 ` [PATCH v10 bpf-next/net 5/6] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_EXCLUSIVE Kuniyuki Iwashima
2025-09-20  0:07 ` [PATCH v10 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).