netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated.
@ 2025-09-17 19:13 Kuniyuki Iwashima
  2025-09-17 19:13 ` [PATCH v9 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-17 19:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

Some protocols (e.g., TCP, UDP) have their own memory accounting for
socket buffers and charge memory to global per-protocol counters such
as /proc/net/ipv4/tcp_mem.

If the socket has sk->sk_memcg, this memory is also charged to the memcg
as sock in memory.stat.

We do not need to pay costs for two orthogonal memory accounting
mechanisms.

This series allows decoupling memcg from the global memory accounting
(memcg + tcp_mem -> memcg) if socket is configured as such by sysctl
or BPF prog.


Overview of the series:

  patch 1 & 2 prepares for decoupling memcg from sk_prot->memory_allocated
    based on the SK_MEMCG_EXCLUSIVE flag
  patch 3 introduces net.core.memcg_exclusive
  patch 4 & 5 supports flagging SK_MEMCG_EXCLUSIVE via bpf_setsockopt()
  patch 6 is selftest


Changes:
  v9:
    * Patch 1: Drop sk_is_mptcp() as it's always true for MPTCP subflow
               and unnecessary for SCTP (kernel-test-robot)

  v8:
    * Patch 3: Fix build failure when CONFIG_NET=n (kernel-test-robot)

  v7: https://lore.kernel.org/netdev/20250909204632.3994767-1-kuniyu@google.com/
    * Rename s/ISOLATED/EXCLUSIVE/ (Shakeel)
    * Add patch 3 (net.core.memcg_exclusive sysctl) (Shakeel)
    * Reorder the core patch 2 before sysctl + bpf changes
    * Patch 6
      * Add test for sysctl

  v6: https://lore.kernel.org/netdev/20250908223750.3375376-1-kuniyu@google.com/
    * Patch 4
      * Update commit message (Shakeel)
    * Patch 5
      * Trace sk_prot->memory_allocated + sk_prot->memory_per_cpu_fw_alloc (Martin)

  v5: https://lore.kernel.org/netdev/20250903190238.2511885-1-kuniyu@google.com/
    * Patch 2
      * Rename new variants to bpf_sock_create_{get,set}sockopt() (Martin)
    * Patch 3
      * Limit getsockopt() to BPF_CGROUP_INET_SOCK_CREATE (Martin)
    * Patch 5
      * Use kern_sync_rcu() (Martin)
      * Double NR_SEND to 128

  v4: https://lore.kernel.org/netdev/20250829010026.347440-1-kuniyu@google.com/
    * Patch 2
      * Use __bpf_setsockopt() instead of _bpf_setsockopt() (Martin)
      * Add getsockopt() for a cgroup with multiple bpf progs running (Martin)
    * Patch 3
      * Only allow inet_create() to set flags (Martin)
      * Inherit flags from listener to child in sk_clone_lock() (Martin)
      * Support clearing flags (Martin)
    * Patch 5
      * Only use inet_create() hook
      * Test bpf_getsockopt()
      * Add serial_ prefix
      * Reduce sleep() and the amount of sent data

  v3: https://lore.kernel.org/netdev/20250826183940.3310118-1-kuniyu@google.com/
    * Drop patches for accept() hook
    * Patch 1
      * Merge if blocks
    * Patch2
      * Drop bpf_func_proto for accept()
    * Patch 3
      * Allow flagging without sk->sk_memcg
      * Inherit SK_BPF_MEMCG_SOCK_ISOLATED in __inet_accept()

  v2: https://lore.kernel.org/bpf/20250825204158.2414402-1-kuniyu@google.com/
    * Patch 2
      * Define BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT() when CONFIG_CGROUP_BPF=n
    * Patch 5
      * Make 2 new bpf_func_proto static
    * Patch 6
      * s/mem_cgroup_sk_set_flag/mem_cgroup_sk_set_flags/ when CONFIG_MEMCG=n
      * Use finer CONFIG_CGROUP_BPF instead of CONFIG_BPF_SYSCALL for ifdef

  v1: https://lore.kernel.org/netdev/20250822221846.744252-1-kuniyu@google.com/


Kuniyuki Iwashima (6):
  tcp: Save lock_sock() for memcg in inet_csk_accept().
  net-memcg: Allow decoupling memcg from global protocol memory
    accounting.
  net-memcg: Introduce net.core.memcg_exclusive sysctl.
  bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE.
  bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_EXCLUSIVE.
  selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE.

 Documentation/admin-guide/sysctl/net.rst      |   9 +
 include/net/netns/core.h                      |   3 +
 include/net/proto_memory.h                    |  15 +-
 include/net/sock.h                            |  47 +++-
 include/net/tcp.h                             |  10 +-
 include/uapi/linux/bpf.h                      |   6 +
 mm/memcontrol.c                               |  15 +-
 net/core/filter.c                             |  82 ++++++
 net/core/sock.c                               |  65 +++--
 net/core/sysctl_net_core.c                    |  11 +
 net/ipv4/af_inet.c                            |  36 +++
 net/ipv4/inet_connection_sock.c               |  26 +-
 net/ipv4/tcp.c                                |   3 +-
 net/ipv4/tcp_output.c                         |  10 +-
 net/mptcp/protocol.c                          |   3 +-
 net/tls/tls_device.c                          |   4 +-
 tools/include/uapi/linux/bpf.h                |   6 +
 .../selftests/bpf/prog_tests/sk_memcg.c       | 261 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/sk_memcg.c  | 146 ++++++++++
 19 files changed, 700 insertions(+), 58 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_memcg.c
 create mode 100644 tools/testing/selftests/bpf/progs/sk_memcg.c

-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v9 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept().
  2025-09-17 19:13 [PATCH v9 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
@ 2025-09-17 19:13 ` Kuniyuki Iwashima
  2025-09-17 19:13 ` [PATCH v9 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-17 19:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

If memcg is enabled, accept() acquires lock_sock() twice for each new
TCP/MPTCP socket in inet_csk_accept() and __inet_accept().

Let's move memcg operations from inet_csk_accept() to __inet_accept().

Note that SCTP somehow allocates a new socket by sk_alloc() in
sk->sk_prot->accept() and clones fields manually, instead of using
sk_clone_lock().

mem_cgroup_sk_alloc() is called for SCTP before __inet_accept(),
so I added the protocol check in __inet_accept(), but this can be
removed once SCTP uses sk_clone_lock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
---
v9: Drop sk_is_mptcp() check as sk_is_tcp() is true for MPTCP subflow
v3: Don't split if blocks
---
 net/ipv4/af_inet.c              | 22 ++++++++++++++++++++++
 net/ipv4/inet_connection_sock.c | 25 -------------------------
 2 files changed, 22 insertions(+), 25 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 76e38092cd8a..c99724b3db04 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -753,6 +753,28 @@ EXPORT_SYMBOL(inet_stream_connect);
 
 void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *newsk)
 {
+	/* TODO: use sk_clone_lock() in SCTP and remove protocol checks */
+	if (mem_cgroup_sockets_enabled &&
+	    (!IS_ENABLED(CONFIG_IP_SCTP) || sk_is_tcp(newsk))) {
+		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
+
+		mem_cgroup_sk_alloc(newsk);
+
+		if (mem_cgroup_from_sk(newsk)) {
+			int amt;
+
+			/* The socket has not been accepted yet, no need
+			 * to look at newsk->sk_wmem_queued.
+			 */
+			amt = sk_mem_pages(newsk->sk_forward_alloc +
+					   atomic_read(&newsk->sk_rmem_alloc));
+			if (amt)
+				mem_cgroup_sk_charge(newsk, amt, gfp);
+		}
+
+		kmem_cache_charge(newsk, gfp);
+	}
+
 	sock_rps_record_flow(newsk);
 	WARN_ON(!((1 << newsk->sk_state) &
 		  (TCPF_ESTABLISHED | TCPF_SYN_RECV |
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 0ef1eacd539d..ed10b959a906 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -708,31 +708,6 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
 
 	release_sock(sk);
 
-	if (mem_cgroup_sockets_enabled) {
-		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
-		int amt = 0;
-
-		/* atomically get the memory usage, set and charge the
-		 * newsk->sk_memcg.
-		 */
-		lock_sock(newsk);
-
-		mem_cgroup_sk_alloc(newsk);
-		if (mem_cgroup_from_sk(newsk)) {
-			/* The socket has not been accepted yet, no need
-			 * to look at newsk->sk_wmem_queued.
-			 */
-			amt = sk_mem_pages(newsk->sk_forward_alloc +
-					   atomic_read(&newsk->sk_rmem_alloc));
-		}
-
-		if (amt)
-			mem_cgroup_sk_charge(newsk, amt, gfp);
-		kmem_cache_charge(newsk, gfp);
-
-		release_sock(newsk);
-	}
-
 	if (req)
 		reqsk_put(req);
 
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v9 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting.
  2025-09-17 19:13 [PATCH v9 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
  2025-09-17 19:13 ` [PATCH v9 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
@ 2025-09-17 19:13 ` Kuniyuki Iwashima
  2025-09-17 19:13 ` [PATCH v9 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-17 19:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

Some protocols (e.g., TCP, UDP) implement memory accounting for socket
buffers and charge memory to per-protocol global counters pointed to by
sk->sk_proto->memory_allocated.

If a socket has sk->sk_memcg, this memory is also charged to memcg as
"sock" in memory.stat.

We do not need to pay costs for two orthogonal memory accounting
mechanisms.  A microbenchmark result is in the subsequent bpf patch.

Let's decouple sockets under memcg from the global per-protocol memory
accounting if mem_cgroup_sk_exclusive() returns true.

Note that this does NOT disable memcg, but rather the per-protocol one.

mem_cgroup_sk_exclusive() starts to return true in the following patches,
and then, the per-protocol memory accounting will be skipped.

In __inet_accept(), we need to reclaim counts that are already charged
for child sockets because we do not allocate sk->sk_memcg until accept().

trace_sock_exceed_buf_limit() will always show 0 as accounted for the
memcg-exclusive sockets, but this can be obtained in memory.stat.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Nacked-by: Johannes Weiner <hannes@cmpxchg.org>
---
v7: Reorder before sysctl & bpf patches
v6: Update commit message
---
 include/net/proto_memory.h      | 15 ++++++--
 include/net/sock.h              | 10 ++++++
 include/net/tcp.h               | 10 ++++--
 net/core/sock.c                 | 64 ++++++++++++++++++++++-----------
 net/ipv4/af_inet.c              | 12 ++++++-
 net/ipv4/inet_connection_sock.c |  1 +
 net/ipv4/tcp.c                  |  3 +-
 net/ipv4/tcp_output.c           | 10 ++++--
 net/mptcp/protocol.c            |  3 +-
 net/tls/tls_device.c            |  4 ++-
 10 files changed, 100 insertions(+), 32 deletions(-)

diff --git a/include/net/proto_memory.h b/include/net/proto_memory.h
index 72d4ec413ab5..4383cb4cb2d2 100644
--- a/include/net/proto_memory.h
+++ b/include/net/proto_memory.h
@@ -31,13 +31,22 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
 	if (!sk->sk_prot->memory_pressure)
 		return false;
 
-	if (mem_cgroup_sk_enabled(sk) &&
-	    mem_cgroup_sk_under_memory_pressure(sk))
-		return true;
+	if (mem_cgroup_sk_enabled(sk)) {
+		if (mem_cgroup_sk_under_memory_pressure(sk))
+			return true;
+
+		if (mem_cgroup_sk_exclusive(sk))
+			return false;
+	}
 
 	return !!READ_ONCE(*sk->sk_prot->memory_pressure);
 }
 
+static inline bool sk_should_enter_memory_pressure(struct sock *sk)
+{
+	return !mem_cgroup_sk_enabled(sk) || !mem_cgroup_sk_exclusive(sk);
+}
+
 static inline long
 proto_memory_allocated(const struct proto *prot)
 {
diff --git a/include/net/sock.h b/include/net/sock.h
index 63a6a48afb48..66501ab670eb 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2607,6 +2607,11 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
 	return mem_cgroup_sockets_enabled && mem_cgroup_from_sk(sk);
 }
 
+static inline bool mem_cgroup_sk_exclusive(const struct sock *sk)
+{
+	return false;
+}
+
 static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
@@ -2634,6 +2639,11 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
 	return false;
 }
 
+static inline bool mem_cgroup_sk_exclusive(const struct sock *sk)
+{
+	return false;
+}
+
 static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
 {
 	return false;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2936b8175950..225f6bac06c3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -275,9 +275,13 @@ extern unsigned long tcp_memory_pressure;
 /* optimized version of sk_under_memory_pressure() for TCP sockets */
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
-	if (mem_cgroup_sk_enabled(sk) &&
-	    mem_cgroup_sk_under_memory_pressure(sk))
-		return true;
+	if (mem_cgroup_sk_enabled(sk)) {
+		if (mem_cgroup_sk_under_memory_pressure(sk))
+			return true;
+
+		if (mem_cgroup_sk_exclusive(sk))
+			return false;
+	}
 
 	return READ_ONCE(tcp_memory_pressure);
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 8002ac6293dc..814966309b0e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1046,17 +1046,21 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
 	if (!charged)
 		return -ENOMEM;
 
-	/* pre-charge to forward_alloc */
-	sk_memory_allocated_add(sk, pages);
-	allocated = sk_memory_allocated(sk);
-	/* If the system goes into memory pressure with this
-	 * precharge, give up and return error.
-	 */
-	if (allocated > sk_prot_mem_limits(sk, 1)) {
-		sk_memory_allocated_sub(sk, pages);
-		mem_cgroup_sk_uncharge(sk, pages);
-		return -ENOMEM;
+	if (!mem_cgroup_sk_exclusive(sk)) {
+		/* pre-charge to forward_alloc */
+		sk_memory_allocated_add(sk, pages);
+		allocated = sk_memory_allocated(sk);
+
+		/* If the system goes into memory pressure with this
+		 * precharge, give up and return error.
+		 */
+		if (allocated > sk_prot_mem_limits(sk, 1)) {
+			sk_memory_allocated_sub(sk, pages);
+			mem_cgroup_sk_uncharge(sk, pages);
+			return -ENOMEM;
+		}
 	}
+
 	sk_forward_alloc_add(sk, pages << PAGE_SHIFT);
 
 	WRITE_ONCE(sk->sk_reserved_mem,
@@ -3153,8 +3157,11 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 	if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation)))
 		return true;
 
-	sk_enter_memory_pressure(sk);
+	if (sk_should_enter_memory_pressure(sk))
+		sk_enter_memory_pressure(sk);
+
 	sk_stream_moderate_sndbuf(sk);
+
 	return false;
 }
 EXPORT_SYMBOL(sk_page_frag_refill);
@@ -3267,18 +3274,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
 	bool memcg_enabled = false, charged = false;
 	struct proto *prot = sk->sk_prot;
-	long allocated;
-
-	sk_memory_allocated_add(sk, amt);
-	allocated = sk_memory_allocated(sk);
+	long allocated = 0;
 
 	if (mem_cgroup_sk_enabled(sk)) {
+		bool exclusive = mem_cgroup_sk_exclusive(sk);
+
 		memcg_enabled = true;
 		charged = mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge());
-		if (!charged)
+
+		if (exclusive && charged)
+			return 1;
+
+		if (!charged) {
+			if (!exclusive) {
+				sk_memory_allocated_add(sk, amt);
+				allocated = sk_memory_allocated(sk);
+			}
+
 			goto suppress_allocation;
+		}
 	}
 
+	sk_memory_allocated_add(sk, amt);
+	allocated = sk_memory_allocated(sk);
+
 	/* Under limit. */
 	if (allocated <= sk_prot_mem_limits(sk, 0)) {
 		sk_leave_memory_pressure(sk);
@@ -3357,7 +3376,8 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 
 	trace_sock_exceed_buf_limit(sk, prot, allocated, kind);
 
-	sk_memory_allocated_sub(sk, amt);
+	if (allocated)
+		sk_memory_allocated_sub(sk, amt);
 
 	if (charged)
 		mem_cgroup_sk_uncharge(sk, amt);
@@ -3396,11 +3416,15 @@ EXPORT_SYMBOL(__sk_mem_schedule);
  */
 void __sk_mem_reduce_allocated(struct sock *sk, int amount)
 {
-	sk_memory_allocated_sub(sk, amount);
-
-	if (mem_cgroup_sk_enabled(sk))
+	if (mem_cgroup_sk_enabled(sk)) {
 		mem_cgroup_sk_uncharge(sk, amount);
 
+		if (mem_cgroup_sk_exclusive(sk))
+			return;
+	}
+
+	sk_memory_allocated_sub(sk, amount);
+
 	if (sk_under_global_memory_pressure(sk) &&
 	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
 		sk_leave_memory_pressure(sk);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index c99724b3db04..c410eb525ebe 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -95,6 +95,7 @@
 #include <net/checksum.h>
 #include <net/ip.h>
 #include <net/protocol.h>
+#include <net/proto_memory.h>
 #include <net/arp.h>
 #include <net/route.h>
 #include <net/ip_fib.h>
@@ -768,8 +769,17 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
 			 */
 			amt = sk_mem_pages(newsk->sk_forward_alloc +
 					   atomic_read(&newsk->sk_rmem_alloc));
-			if (amt)
+			if (amt) {
+				/* This amt is already charged globally to
+				 * sk_prot->memory_allocated due to lack of
+				 * sk_memcg until accept(), thus we need to
+				 * reclaim it here if newsk is isolated.
+				 */
+				if (mem_cgroup_sk_exclusive(newsk))
+					sk_memory_allocated_sub(newsk, amt);
+
 				mem_cgroup_sk_charge(newsk, amt, gfp);
+			}
 		}
 
 		kmem_cache_charge(newsk, gfp);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index ed10b959a906..f8dd53d40dcf 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -22,6 +22,7 @@
 #include <net/tcp.h>
 #include <net/sock_reuseport.h>
 #include <net/addrconf.h>
+#include <net/proto_memory.h>
 
 #if IS_ENABLED(CONFIG_IPV6)
 /* match_sk*_wildcard == true:  IPV6_ADDR_ANY equals to any IPv6 addresses
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 71a956fbfc55..dcbd49e2f8af 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -908,7 +908,8 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
 		}
 		__kfree_skb(skb);
 	} else {
-		sk->sk_prot->enter_memory_pressure(sk);
+		if (sk_should_enter_memory_pressure(sk))
+			tcp_enter_memory_pressure(sk);
 		sk_stream_moderate_sndbuf(sk);
 	}
 	return NULL;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index dfbac0876d96..4b6a7250a9c2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3574,12 +3574,18 @@ void sk_forced_mem_schedule(struct sock *sk, int size)
 	delta = size - sk->sk_forward_alloc;
 	if (delta <= 0)
 		return;
+
 	amt = sk_mem_pages(delta);
 	sk_forward_alloc_add(sk, amt << PAGE_SHIFT);
-	sk_memory_allocated_add(sk, amt);
 
-	if (mem_cgroup_sk_enabled(sk))
+	if (mem_cgroup_sk_enabled(sk)) {
 		mem_cgroup_sk_charge(sk, amt, gfp_memcg_charge() | __GFP_NOFAIL);
+
+		if (mem_cgroup_sk_exclusive(sk))
+			return;
+	}
+
+	sk_memory_allocated_add(sk, amt);
 }
 
 /* Send a FIN. The caller locks the socket for us.
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 9a287b75c1b3..f7487e22a3f8 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -16,6 +16,7 @@
 #include <net/inet_common.h>
 #include <net/inet_hashtables.h>
 #include <net/protocol.h>
+#include <net/proto_memory.h>
 #include <net/tcp_states.h>
 #if IS_ENABLED(CONFIG_MPTCP_IPV6)
 #include <net/transp_v6.h>
@@ -1016,7 +1017,7 @@ static void mptcp_enter_memory_pressure(struct sock *sk)
 	mptcp_for_each_subflow(msk, subflow) {
 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
 
-		if (first)
+		if (first && sk_should_enter_memory_pressure(ssk))
 			tcp_enter_memory_pressure(ssk);
 		sk_stream_moderate_sndbuf(ssk);
 
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index f672a62a9a52..6696ef837116 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -35,6 +35,7 @@
 #include <linux/netdevice.h>
 #include <net/dst.h>
 #include <net/inet_connection_sock.h>
+#include <net/proto_memory.h>
 #include <net/tcp.h>
 #include <net/tls.h>
 #include <linux/skbuff_ref.h>
@@ -371,7 +372,8 @@ static int tls_do_allocation(struct sock *sk,
 	if (!offload_ctx->open_record) {
 		if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
 						   sk->sk_allocation))) {
-			READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk);
+			if (sk_should_enter_memory_pressure(sk))
+				READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk);
 			sk_stream_moderate_sndbuf(sk);
 			return -ENOMEM;
 		}
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v9 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl.
  2025-09-17 19:13 [PATCH v9 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
  2025-09-17 19:13 ` [PATCH v9 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
  2025-09-17 19:13 ` [PATCH v9 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
@ 2025-09-17 19:13 ` Kuniyuki Iwashima
  2025-09-17 19:14 ` [PATCH v9 bpf-next/net 4/6] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-17 19:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

If net.core.memcg_exclusive is 1 when sk->sk_memcg is allocated,
the socket is flagged with SK_MEMCG_EXCLUSIVE internally and skips
the global per-protocol memory accounting.

OTOH, for accept()ed child sockets, this flag is inherited from
the listening socket in sk_clone_lock() and set in __inet_accept().
This is to preserve the decision by BPF which will be supported later.

Given sk->sk_memcg can be accessed in the fast path, it would
be preferable to place the flag field in the same cache line as
sk->sk_memcg.

However, struct sock does not have such a 1-byte hole.

Let's store the flag in the lowest bit of sk->sk_memcg and check
it in mem_cgroup_sk_exclusive().

Tested with a script that creates local socket pairs and send()s a
bunch of data without recv()ing.

Setup:

  # mkdir /sys/fs/cgroup/test
  # echo $$ >> /sys/fs/cgroup/test/cgroup.procs
  # sysctl -q net.ipv4.tcp_mem="1000 1000 1000"

Without net.core.memcg_exclusive, charged to memcg & tcp_mem:

  # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 22642688 <-------------------------------------- charged to memcg
  # cat /proc/net/sockstat| grep TCP
  TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 <-- charged to tcp_mem
  # ss -tn | head -n 5
  State Recv-Q Send-Q Local Address:Port  Peer Address:Port
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53188
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:49972
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53868
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53554
  # nstat | grep Pressure || echo no pressure
  TcpExtTCPMemoryPressures        1                  0.0

With net.core.memcg_exclusive=1, only charged to memcg:

  # sysctl -q net.core.memcg_exclusive=1
  # prlimit -n=524288:524288 bash -c "python3 pressure.py" &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 2757468160 <------------------------------------ charged to memcg
  # cat /proc/net/sockstat | grep TCP
  TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 <- NOT charged to tcp_mem
  # ss -tn | head -n 5
  State Recv-Q Send-Q  Local Address:Port  Peer Address:Port
  ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:49026
  ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:45630
  ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:44870
  ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:45274
  # nstat | grep Pressure || echo no pressure
  no pressure

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v8: Fix build failure when CONFIG_NET=n
---
 Documentation/admin-guide/sysctl/net.rst |  9 ++++++
 include/net/netns/core.h                 |  3 ++
 include/net/sock.h                       | 39 ++++++++++++++++++++++--
 mm/memcontrol.c                          | 12 +++++++-
 net/core/sock.c                          |  1 +
 net/core/sysctl_net_core.c               | 11 +++++++
 net/ipv4/af_inet.c                       |  4 +++
 7 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 2ef50828aff1..7272194dcf45 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -212,6 +212,15 @@ mem_pcpu_rsv
 
 Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU.
 
+memcg_exclusive
+---------------
+
+Skip charging socket buffers to the per-protocol global memory accounting
+(controlled by net.ipv4.tcp_mem, etc) if they are already charged to the
+cgroup memory controller ("sock" in memory.stat file).
+
+Default: 0
+
 rmem_default
 ------------
 
diff --git a/include/net/netns/core.h b/include/net/netns/core.h
index 9b36f0ff0c20..ec511088e67d 100644
--- a/include/net/netns/core.h
+++ b/include/net/netns/core.h
@@ -16,6 +16,9 @@ struct netns_core {
 	int	sysctl_optmem_max;
 	u8	sysctl_txrehash;
 	u8	sysctl_tstamp_allow_data;
+#ifdef CONFIG_MEMCG
+	u8	sysctl_memcg_exclusive;
+#endif
 
 #ifdef CONFIG_PROC_FS
 	struct prot_inuse __percpu *prot_inuse;
diff --git a/include/net/sock.h b/include/net/sock.h
index 66501ab670eb..0f597f4deaa3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2596,10 +2596,36 @@ static inline gfp_t gfp_memcg_charge(void)
 	return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
 }
 
+enum {
+	SK_MEMCG_EXCLUSIVE	= (1UL << 0),
+	SK_MEMCG_FLAG_MAX	= (1UL << 1),
+};
+
+#define SK_MEMCG_FLAG_MASK	(SK_MEMCG_FLAG_MAX - 1)
+#define SK_MEMCG_PTR_MASK	~SK_MEMCG_FLAG_MASK
+
 #ifdef CONFIG_MEMCG
 static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
 {
-	return sk->sk_memcg;
+	unsigned long val = (unsigned long)sk->sk_memcg;
+
+	val &= SK_MEMCG_PTR_MASK;
+	return (struct mem_cgroup *)val;
+}
+
+static inline void mem_cgroup_sk_set_flags(struct sock *sk, unsigned short flags)
+{
+	unsigned long val = (unsigned long)mem_cgroup_from_sk(sk);
+
+	val |= flags;
+	sk->sk_memcg = (struct mem_cgroup *)val;
+}
+
+static inline unsigned short mem_cgroup_sk_get_flags(const struct sock *sk)
+{
+	unsigned long val = (unsigned long)sk->sk_memcg;
+
+	return val & SK_MEMCG_FLAG_MASK;
 }
 
 static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
@@ -2609,7 +2635,7 @@ static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
 
 static inline bool mem_cgroup_sk_exclusive(const struct sock *sk)
 {
-	return false;
+	return mem_cgroup_sk_get_flags(sk) & SK_MEMCG_EXCLUSIVE;
 }
 
 static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
@@ -2634,6 +2660,15 @@ static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
 	return NULL;
 }
 
+static inline void mem_cgroup_sk_set_flags(struct sock *sk, unsigned short flags)
+{
+}
+
+static inline unsigned short mem_cgroup_sk_get_flags(const struct sock *sk)
+{
+	return 0;
+}
+
 static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
 {
 	return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df3e9205c9e6..88028af8ac28 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4995,6 +4995,16 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
 DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
 EXPORT_SYMBOL(memcg_sockets_enabled_key);
 
+static void mem_cgroup_sk_set(struct sock *sk, struct mem_cgroup *memcg)
+{
+	sk->sk_memcg = memcg;
+
+#ifdef CONFIG_NET
+	if (READ_ONCE(sock_net(sk)->core.sysctl_memcg_exclusive))
+		mem_cgroup_sk_set_flags(sk, SK_MEMCG_EXCLUSIVE);
+#endif
+}
+
 void mem_cgroup_sk_alloc(struct sock *sk)
 {
 	struct mem_cgroup *memcg;
@@ -5013,7 +5023,7 @@ void mem_cgroup_sk_alloc(struct sock *sk)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg1_tcpmem_active(memcg))
 		goto out;
 	if (css_tryget(&memcg->css))
-		sk->sk_memcg = memcg;
+		mem_cgroup_sk_set(sk, memcg);
 out:
 	rcu_read_unlock();
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 814966309b0e..348e599c3fbc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2519,6 +2519,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 #ifdef CONFIG_MEMCG
 	/* sk->sk_memcg will be populated at accept() time */
 	newsk->sk_memcg = NULL;
+	mem_cgroup_sk_set_flags(newsk, mem_cgroup_sk_get_flags(sk));
 #endif
 
 	cgroup_sk_clone(&newsk->sk_cgrp_data);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 8cf04b57ade1..c8b5fc3b8435 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -676,6 +676,17 @@ static struct ctl_table netns_core_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE
 	},
+#ifdef CONFIG_MEMCG
+	{
+		.procname	= "memcg_exclusive",
+		.data		= &init_net.core.sysctl_memcg_exclusive,
+		.maxlen		= sizeof(u8),
+		.mode		= 0644,
+		.proc_handler	= proc_dou8vec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE
+	},
+#endif
 	/* sysctl_core_net_init() will set the values after this
 	 * to readonly in network namespaces
 	 */
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index c410eb525ebe..be27de94143c 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -758,12 +758,16 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
 	if (mem_cgroup_sockets_enabled &&
 	    (!IS_ENABLED(CONFIG_IP_SCTP) || sk_is_tcp(newsk))) {
 		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
+		unsigned short flags;
 
+		flags = mem_cgroup_sk_get_flags(newsk);
 		mem_cgroup_sk_alloc(newsk);
 
 		if (mem_cgroup_from_sk(newsk)) {
 			int amt;
 
+			mem_cgroup_sk_set_flags(newsk, flags);
+
 			/* The socket has not been accepted yet, no need
 			 * to look at newsk->sk_wmem_queued.
 			 */
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v9 bpf-next/net 4/6] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE.
  2025-09-17 19:13 [PATCH v9 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2025-09-17 19:13 ` [PATCH v9 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl Kuniyuki Iwashima
@ 2025-09-17 19:14 ` Kuniyuki Iwashima
  2025-09-17 19:14 ` [PATCH v9 bpf-next/net 5/6] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_EXCLUSIVE Kuniyuki Iwashima
  2025-09-17 19:14 ` [PATCH v9 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE Kuniyuki Iwashima
  5 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-17 19:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

We will support flagging SK_MEMCG_EXCLUSIVE via bpf_setsockopt() at
the BPF_CGROUP_INET_SOCK_CREATE hook.

BPF_CGROUP_INET_SOCK_CREATE is invoked by __cgroup_bpf_run_filter_sk()
that passes a pointer to struct sock to the bpf prog as void *ctx.

But there are no bpf_func_proto for bpf_setsockopt() that receives
the ctx as a pointer to struct sock.

Also, bpf_getsockopt() will be necessary for a cgroup with multiple
bpf progs running.

Let's add new bpf_setsockopt() and bpf_getsockopt() variants for
BPF_CGROUP_INET_SOCK_CREATE.

Note that inet_create() is not under lock_sock() and has the same
semantics with bpf_lsm_unlocked_sockopt_hooks.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v5: Rename new variants to bpf_sock_create_{get,set}sockopt().
v4:
  * Use __bpf_setsockopt() instead of _bpf_setsockopt()
  * Add getsockopt() for a cgroup with multiple bpf progs running
v3: Remove bpf_func_proto for accept()
v2: Make 2 new bpf_func_proto static
---
 net/core/filter.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 63f3baee2daf..31b259f02ee9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5723,6 +5723,40 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
 	.arg5_type	= ARG_CONST_SIZE,
 };
 
+BPF_CALL_5(bpf_sock_create_setsockopt, struct sock *, sk, int, level,
+	   int, optname, char *, optval, int, optlen)
+{
+	return __bpf_setsockopt(sk, level, optname, optval, optlen);
+}
+
+static const struct bpf_func_proto bpf_sock_create_setsockopt_proto = {
+	.func		= bpf_sock_create_setsockopt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_PTR_TO_MEM | MEM_RDONLY,
+	.arg5_type	= ARG_CONST_SIZE,
+};
+
+BPF_CALL_5(bpf_sock_create_getsockopt, struct sock *, sk, int, level,
+	   int, optname, char *, optval, int, optlen)
+{
+	return __bpf_getsockopt(sk, level, optname, optval, optlen);
+}
+
+static const struct bpf_func_proto bpf_sock_create_getsockopt_proto = {
+	.func		= bpf_sock_create_getsockopt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg5_type	= ARG_CONST_SIZE,
+};
+
 BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 	   int, level, int, optname, char *, optval, int, optlen)
 {
@@ -8051,6 +8085,20 @@ sock_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_sk_storage_get_cg_sock_proto;
 	case BPF_FUNC_ktime_get_coarse_ns:
 		return &bpf_ktime_get_coarse_ns_proto;
+	case BPF_FUNC_setsockopt:
+		switch (prog->expected_attach_type) {
+		case BPF_CGROUP_INET_SOCK_CREATE:
+			return &bpf_sock_create_setsockopt_proto;
+		default:
+			return NULL;
+		}
+	case BPF_FUNC_getsockopt:
+		switch (prog->expected_attach_type) {
+		case BPF_CGROUP_INET_SOCK_CREATE:
+			return &bpf_sock_create_getsockopt_proto;
+		default:
+			return NULL;
+		}
 	default:
 		return bpf_base_func_proto(func_id, prog);
 	}
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v9 bpf-next/net 5/6] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_EXCLUSIVE.
  2025-09-17 19:13 [PATCH v9 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2025-09-17 19:14 ` [PATCH v9 bpf-next/net 4/6] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE Kuniyuki Iwashima
@ 2025-09-17 19:14 ` Kuniyuki Iwashima
  2025-09-17 19:14 ` [PATCH v9 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE Kuniyuki Iwashima
  5 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-17 19:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

If a socket has sk->sk_memcg with SK_MEMCG_EXCLUSIVE, it is decoupled
from the global protocol memory accounting.

This is controlled by net.core.memcg_exclusive sysctl, but it lacks
flexibility.

Let's support flagging (and clearing) SK_MEMCG_EXCLUSIVE via
bpf_setsockopt() at the BPF_CGROUP_INET_SOCK_CREATE hook.

  u32 flags = SK_BPF_MEMCG_EXCLUSIVE;

  bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS,
                 &flags, sizeof(flags));

As with net.core.memcg_exclusive, this is inherited to child sockets,
and BPF always takes precedence over sysctl at socket(2) and accept(2).

SK_BPF_MEMCG_FLAGS is only supported at BPF_CGROUP_INET_SOCK_CREATE
and not supported on other hooks for some reasons:

  1. UDP charges memory under sk->sk_receive_queue.lock instead
     of lock_sock()

  2. For TCP child sockets, memory accounting is adjusted only in
     __inet_accept() which sk->sk_memcg allocation is deferred to

  3. Modifying the flag after skb is charged to sk requires such
     adjustment during bpf_setsockopt() and complicates the logic
     unnecessarily

We can support other hooks later if a real use case justifies that.

Most changes are inline and hard to trace, but a microbenchmark on
__sk_mem_raise_allocated() during neper/tcp_stream showed that more
samples completed faster with SK_MEMCG_EXCLUSIVE.  This will be more
visible under tcp_mem pressure.

  # bpftrace -e 'kprobe:__sk_mem_raise_allocated { @start[tid] = nsecs; }
    kretprobe:__sk_mem_raise_allocated /@start[tid]/
    { @end[tid] = nsecs - @start[tid]; @times = hist(@end[tid]); delete(@start[tid]); }'
  # tcp_stream -6 -F 1000 -N -T 256

Without bpf prog:

  [128, 256)          3846 |                                                    |
  [256, 512)       1505326 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [512, 1K)        1371006 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     |
  [1K, 2K)          198207 |@@@@@@                                              |
  [2K, 4K)           31199 |@                                                   |

With bpf prog in the next patch:
  (must be attached before tcp_stream)
  # bpftool prog load sk_memcg.bpf.o /sys/fs/bpf/sk_memcg type cgroup/sock_create
  # bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_create pinned /sys/fs/bpf/sk_memcg

  [128, 256)          6413 |                                                    |
  [256, 512)       1868425 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [512, 1K)        1101697 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
  [1K, 2K)          117031 |@@@@                                                |
  [2K, 4K)           11773 |                                                    |

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v7:
  * Update commit message.

v5:
  * Limit getsockopt() to BPF_CGROUP_INET_SOCK_CREATE

v4:
  * Only allow inet_create() to set flags
  * Inherit flags from listener to child in sk_clone_lock()
  * Support clearing flags

v3:
  * Allow setting flags without sk->sk_memcg in sk_bpf_set_get_memcg_flags()
  * Preserve flags in __inet_accept()

v2:
  * s/mem_cgroup_sk_set_flag/mem_cgroup_sk_set_flags/ when CONFIG_MEMCG=n
  * Use CONFIG_CGROUP_BPF instead of CONFIG_BPF_SYSCALL for ifdef
---
 include/uapi/linux/bpf.h       |  6 ++++++
 mm/memcontrol.c                |  3 +++
 net/core/filter.c              | 34 ++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  6 ++++++
 4 files changed, 49 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 233de8677382..35e3ce40ac90 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7182,6 +7182,7 @@ enum {
 	TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
 	TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
 	SK_BPF_CB_FLAGS		= 1009, /* Get or set sock ops flags in socket */
+	SK_BPF_MEMCG_FLAGS	= 1010, /* Get or Set flags saved in sk->sk_memcg */
 };
 
 enum {
@@ -7204,6 +7205,11 @@ enum {
 						 */
 };
 
+enum {
+	SK_BPF_MEMCG_EXCLUSIVE	= (1UL << 0),
+	SK_BPF_MEMCG_FLAG_MAX	= (1UL << 1),
+};
+
 struct bpf_perf_event_value {
 	__u64 counter;
 	__u64 enabled;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 88028af8ac28..b7d405b57e23 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4997,6 +4997,9 @@ EXPORT_SYMBOL(memcg_sockets_enabled_key);
 
 static void mem_cgroup_sk_set(struct sock *sk, struct mem_cgroup *memcg)
 {
+	BUILD_BUG_ON((unsigned short)SK_MEMCG_EXCLUSIVE != SK_BPF_MEMCG_EXCLUSIVE);
+	BUILD_BUG_ON((unsigned short)SK_MEMCG_FLAG_MAX != SK_BPF_MEMCG_FLAG_MAX);
+
 	sk->sk_memcg = memcg;
 
 #ifdef CONFIG_NET
diff --git a/net/core/filter.c b/net/core/filter.c
index 31b259f02ee9..df2496120076 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5723,9 +5723,39 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
 	.arg5_type	= ARG_CONST_SIZE,
 };
 
+static int sk_bpf_set_get_memcg_flags(struct sock *sk,
+				      char *optval, int optlen,
+				      bool getopt)
+{
+	u32 flags;
+
+	if (optlen != sizeof(u32))
+		return -EINVAL;
+
+	if (!sk_has_account(sk))
+		return -EOPNOTSUPP;
+
+	if (getopt) {
+		*(u32 *)optval = mem_cgroup_sk_get_flags(sk);
+		return 0;
+	}
+
+	flags = *(u32 *)optval;
+	if (flags >= SK_BPF_MEMCG_FLAG_MAX)
+		return -EINVAL;
+
+	mem_cgroup_sk_set_flags(sk, flags);
+
+	return 0;
+}
+
 BPF_CALL_5(bpf_sock_create_setsockopt, struct sock *, sk, int, level,
 	   int, optname, char *, optval, int, optlen)
 {
+	if (IS_ENABLED(CONFIG_MEMCG) &&
+	    level == SOL_SOCKET && optname == SK_BPF_MEMCG_FLAGS)
+		return sk_bpf_set_get_memcg_flags(sk, optval, optlen, false);
+
 	return __bpf_setsockopt(sk, level, optname, optval, optlen);
 }
 
@@ -5743,6 +5773,10 @@ static const struct bpf_func_proto bpf_sock_create_setsockopt_proto = {
 BPF_CALL_5(bpf_sock_create_getsockopt, struct sock *, sk, int, level,
 	   int, optname, char *, optval, int, optlen)
 {
+	if (IS_ENABLED(CONFIG_MEMCG) &&
+	    level == SOL_SOCKET && optname == SK_BPF_MEMCG_FLAGS)
+		return sk_bpf_set_get_memcg_flags(sk, optval, optlen, true);
+
 	return __bpf_getsockopt(sk, level, optname, optval, optlen);
 }
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 233de8677382..35e3ce40ac90 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7182,6 +7182,7 @@ enum {
 	TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
 	TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
 	SK_BPF_CB_FLAGS		= 1009, /* Get or set sock ops flags in socket */
+	SK_BPF_MEMCG_FLAGS	= 1010, /* Get or Set flags saved in sk->sk_memcg */
 };
 
 enum {
@@ -7204,6 +7205,11 @@ enum {
 						 */
 };
 
+enum {
+	SK_BPF_MEMCG_EXCLUSIVE	= (1UL << 0),
+	SK_BPF_MEMCG_FLAG_MAX	= (1UL << 1),
+};
+
 struct bpf_perf_event_value {
 	__u64 counter;
 	__u64 enabled;
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v9 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE.
  2025-09-17 19:13 [PATCH v9 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2025-09-17 19:14 ` [PATCH v9 bpf-next/net 5/6] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_EXCLUSIVE Kuniyuki Iwashima
@ 2025-09-17 19:14 ` Kuniyuki Iwashima
  2025-09-17 23:38   ` Martin KaFai Lau
  5 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-17 19:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau
  Cc: John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

The test does the following for IPv4/IPv6 x TCP/UDP sockets
with/without SK_MEMCG_EXCLUSIVE, which can be turned on by
net.core.memcg_exclusive or bpf_setsockopt(SK_BPF_MEMCG_EXCLUSIVE).

  1. Create socket pairs
  2. Send a bunch of data that requires more than 1024 pages
  3. Read memory_allocated from sk->sk_prot->memory_allocated and
     sk->sk_prot->memory_per_cpu_fw_alloc
  4. Check if unread data is charged to memory_allocated

If SK_MEMCG_EXCLUSIVE is set, memory_allocated should not be
changed, but we allow a small error (up to 10 pages) in case
other processes on the host use some amounts of TCP/UDP memory.

The amount of allocated pages are buffered to per-cpu variable
{tcp,udp}_memory_per_cpu_fw_alloc up to +/- net.core.mem_pcpu_rsv
before reported to {tcp,udp}_memory_allocated.

At 3., memory_allocated is calculated from the 2 variables twice
at fentry and fexit of socket create function to check if the per-cpu
value is drained during calculation.  In that case, 3. is retried.

We use kern_sync_rcu() for UDP because UDP recv queue is destroyed
after RCU grace period.

The test takes ~2s on QEMU (64 CPUs) w/ KVM but takes 6s w/o KVM.

  # time ./test_progs -t sk_memcg
  #370/1   sk_memcg/TCP  :OK
  #370/2   sk_memcg/UDP  :OK
  #370/3   sk_memcg/TCPv6:OK
  #370/4   sk_memcg/UDPv6:OK
  #370     sk_memcg:OK
  Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED

  real	0m1.623s
  user	0m0.165s
  sys	0m0.366s

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v7:
  * Add test for sysctl

v6:
  * Trace sk_prot->memory_allocated + sk_prot->memory_per_cpu_fw_alloc

v5:
  * Use kern_sync_rcu()
  * Double NR_SEND to 128

v4:
  * Only use inet_create() hook
  * Test bpf_getsockopt()
  * Add serial_ prefix
  * Reduce sleep() and the amount of sent data
---
 .../selftests/bpf/prog_tests/sk_memcg.c       | 261 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/sk_memcg.c  | 146 ++++++++++
 2 files changed, 407 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_memcg.c
 create mode 100644 tools/testing/selftests/bpf/progs/sk_memcg.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sk_memcg.c b/tools/testing/selftests/bpf/prog_tests/sk_memcg.c
new file mode 100644
index 000000000000..777fb81e9365
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sk_memcg.c
@@ -0,0 +1,261 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2025 Google LLC */
+
+#include <test_progs.h>
+#include "sk_memcg.skel.h"
+#include "network_helpers.h"
+
+#define NR_SOCKETS	64
+#define NR_SEND		128
+#define BUF_SINGLE	1024
+#define BUF_TOTAL	(BUF_SINGLE * NR_SEND)
+
+struct test_case {
+	char name[8];
+	int family;
+	int type;
+	int (*create_sockets)(struct test_case *test_case, int sk[], int len);
+	long (*get_memory_allocated)(struct test_case *test_case, struct sk_memcg *skel);
+};
+
+static int tcp_create_sockets(struct test_case *test_case, int sk[], int len)
+{
+	int server, i;
+
+	server = start_server(test_case->family, test_case->type, NULL, 0, 0);
+	ASSERT_GE(server, 0, "start_server_str");
+
+	for (i = 0; i < len / 2; i++) {
+		sk[i * 2] = connect_to_fd(server, 0);
+		if (!ASSERT_GE(sk[i * 2], 0, "connect_to_fd"))
+			return sk[i * 2];
+
+		sk[i * 2 + 1] = accept(server, NULL, NULL);
+		if (!ASSERT_GE(sk[i * 2 + 1], 0, "accept"))
+			return sk[i * 2 + 1];
+	}
+
+	close(server);
+
+	return 0;
+}
+
+static int udp_create_sockets(struct test_case *test_case, int sk[], int len)
+{
+	int i, err, rcvbuf = BUF_TOTAL;
+
+	for (i = 0; i < len / 2; i++) {
+		sk[i * 2] = start_server(test_case->family, test_case->type, NULL, 0, 0);
+		if (!ASSERT_GE(sk[i * 2], 0, "start_server"))
+			return sk[i * 2];
+
+		sk[i * 2 + 1] = connect_to_fd(sk[i * 2], 0);
+		if (!ASSERT_GE(sk[i * 2 + 1], 0, "connect_to_fd"))
+			return sk[i * 2 + 1];
+
+		err = connect_fd_to_fd(sk[i * 2], sk[i * 2 + 1], 0);
+		if (!ASSERT_EQ(err, 0, "connect_fd_to_fd"))
+			return err;
+
+		err = setsockopt(sk[i * 2], SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(int));
+		if (!ASSERT_EQ(err, 0, "setsockopt(SO_RCVBUF)"))
+			return err;
+
+		err = setsockopt(sk[i * 2 + 1], SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(int));
+		if (!ASSERT_EQ(err, 0, "setsockopt(SO_RCVBUF)"))
+			return err;
+	}
+
+	return 0;
+}
+
+static long get_memory_allocated(struct test_case *test_case,
+				 bool *activated, bool *stable,
+				 long *memory_allocated)
+{
+	*stable = false;
+
+	do {
+		*activated = true;
+
+		/* AF_INET and AF_INET6 share the same memory_allocated.
+		 * tcp_init_sock() is called by AF_INET and AF_INET6,
+		 * but udp_lib_init_sock() is inline.
+		 */
+		socket(AF_INET, test_case->type, 0);
+	} while (!*stable);
+
+	return *memory_allocated;
+}
+
+static long tcp_get_memory_allocated(struct test_case *test_case, struct sk_memcg *skel)
+{
+	return get_memory_allocated(test_case,
+				    &skel->bss->tcp_activated,
+				    &skel->bss->tcp_stable,
+				    &skel->bss->tcp_memory_allocated);
+}
+
+static long udp_get_memory_allocated(struct test_case *test_case, struct sk_memcg *skel)
+{
+	return get_memory_allocated(test_case,
+				    &skel->bss->udp_activated,
+				    &skel->bss->udp_stable,
+				    &skel->bss->udp_memory_allocated);
+}
+
+static int check_exclusive(struct test_case *test_case,
+			   struct sk_memcg *skel, bool exclusive)
+{
+	char buf[BUF_SINGLE] = {};
+	long memory_allocated[2];
+	int sk[NR_SOCKETS] = {};
+	int err, i, j;
+
+	err = test_case->create_sockets(test_case, sk, ARRAY_SIZE(sk));
+	if (err)
+		goto close;
+
+	memory_allocated[0] = test_case->get_memory_allocated(test_case, skel);
+
+	/* allocate pages >= 1024 */
+	for (i = 0; i < ARRAY_SIZE(sk); i++) {
+		for (j = 0; j < NR_SEND; j++) {
+			int bytes = send(sk[i], buf, sizeof(buf), 0);
+
+			/* Avoid too noisy logs when something failed. */
+			if (bytes != sizeof(buf)) {
+				ASSERT_EQ(bytes, sizeof(buf), "send");
+				if (bytes < 0) {
+					err = bytes;
+					goto close;
+				}
+			}
+		}
+	}
+
+	memory_allocated[1] = test_case->get_memory_allocated(test_case, skel);
+
+	if (exclusive)
+		ASSERT_LE(memory_allocated[1], memory_allocated[0] + 10, "exclusive");
+	else
+		ASSERT_GT(memory_allocated[1], memory_allocated[0] + 1024, "not exclusive");
+
+close:
+	for (i = 0; i < ARRAY_SIZE(sk); i++)
+		close(sk[i]);
+
+	if (test_case->type == SOCK_DGRAM) {
+		/* UDP recv queue is destroyed after RCU grace period.
+		 * With one kern_sync_rcu(), memory_allocated[0] of the
+		 * isoalted case often matches with memory_allocated[1]
+		 * of the preceding non-exclusive case.
+		 */
+		kern_sync_rcu();
+		kern_sync_rcu();
+	}
+
+	return err;
+}
+
+void run_test(struct test_case *test_case)
+{
+	struct nstoken *nstoken;
+	struct sk_memcg *skel;
+	int cgroup, err;
+
+	skel = sk_memcg__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	skel->bss->nr_cpus = libbpf_num_possible_cpus();
+
+	err = sk_memcg__attach(skel);
+	if (!ASSERT_OK(err, "attach"))
+		goto destroy_skel;
+
+	cgroup = test__join_cgroup("/sk_memcg");
+	if (!ASSERT_GE(cgroup, 0, "join_cgroup"))
+		goto destroy_skel;
+
+	err = make_netns("sk_memcg");
+	if (!ASSERT_EQ(err, 0, "make_netns"))
+		goto close_cgroup;
+
+	nstoken = open_netns("sk_memcg");
+	if (!ASSERT_OK_PTR(nstoken, "open_netns"))
+		goto remove_netns;
+
+	err = check_exclusive(test_case, skel, false);
+	if (!ASSERT_EQ(err, 0, "test_exclusive(false)"))
+		goto close_netns;
+
+	err = write_sysctl("/proc/sys/net/core/memcg_exclusive", "1");
+	if (!ASSERT_EQ(err, 0, "write_sysctl(1)"))
+		goto close_netns;
+
+	err = check_exclusive(test_case, skel, true);
+	if (!ASSERT_EQ(err, 0, "test_exclusive(true by sysctl)"))
+		goto close_netns;
+
+	err = write_sysctl("/proc/sys/net/core/memcg_exclusive", "0");
+	if (!ASSERT_EQ(err, 0, "write_sysctl(0)"))
+		goto close_netns;
+
+	skel->links.sock_create = bpf_program__attach_cgroup(skel->progs.sock_create, cgroup);
+	if (!ASSERT_OK_PTR(skel->links.sock_create, "attach_cgroup(sock_create)"))
+		goto close_netns;
+
+	err = check_exclusive(test_case, skel, true);
+	ASSERT_EQ(err, 0, "test_exclusive(true by bpf)");
+
+close_netns:
+	close_netns(nstoken);
+remove_netns:
+	remove_netns("sk_memcg");
+close_cgroup:
+	close(cgroup);
+destroy_skel:
+	sk_memcg__destroy(skel);
+}
+
+struct test_case test_cases[] = {
+	{
+		.name = "TCP  ",
+		.family = AF_INET,
+		.type = SOCK_STREAM,
+		.create_sockets = tcp_create_sockets,
+		.get_memory_allocated = tcp_get_memory_allocated,
+	},
+	{
+		.name = "UDP  ",
+		.family = AF_INET,
+		.type = SOCK_DGRAM,
+		.create_sockets = udp_create_sockets,
+		.get_memory_allocated = udp_get_memory_allocated,
+	},
+	{
+		.name = "TCPv6",
+		.family = AF_INET6,
+		.type = SOCK_STREAM,
+		.create_sockets = tcp_create_sockets,
+		.get_memory_allocated = tcp_get_memory_allocated,
+	},
+	{
+		.name = "UDPv6",
+		.family = AF_INET6,
+		.type = SOCK_DGRAM,
+		.create_sockets = udp_create_sockets,
+		.get_memory_allocated = udp_get_memory_allocated,
+	},
+};
+
+void serial_test_sk_memcg(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
+		test__start_subtest(test_cases[i].name);
+		run_test(&test_cases[i]);
+	}
+}
diff --git a/tools/testing/selftests/bpf/progs/sk_memcg.c b/tools/testing/selftests/bpf/progs/sk_memcg.c
new file mode 100644
index 000000000000..6b1a928a0c90
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/sk_memcg.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2025 Google LLC */
+
+#include "bpf_tracing_net.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <errno.h>
+
+extern int tcp_memory_per_cpu_fw_alloc __ksym;
+extern int udp_memory_per_cpu_fw_alloc __ksym;
+
+int nr_cpus;
+bool tcp_activated, tcp_stable, udp_activated, udp_stable;
+long tcp_memory_allocated, udp_memory_allocated;
+static struct sock *tcp_sk_tracing, *udp_sk_tracing;
+
+struct sk_prot {
+	long *memory_allocated;
+	int *memory_per_cpu_fw_alloc;
+};
+
+static int drain_memory_per_cpu_fw_alloc(__u32 i, struct sk_prot *sk_prot_ctx)
+{
+	int *memory_per_cpu_fw_alloc;
+
+	memory_per_cpu_fw_alloc = bpf_per_cpu_ptr(sk_prot_ctx->memory_per_cpu_fw_alloc, i);
+	if (memory_per_cpu_fw_alloc)
+		*sk_prot_ctx->memory_allocated += *memory_per_cpu_fw_alloc;
+
+	return 0;
+}
+
+static long get_memory_allocated(struct sock *_sk, int *memory_per_cpu_fw_alloc)
+{
+	struct sock *sk = bpf_core_cast(_sk, struct sock);
+	struct sk_prot sk_prot_ctx;
+	long memory_allocated;
+
+	/* net_aligned_data.{tcp,udp}_memory_allocated was not available. */
+	memory_allocated = sk->__sk_common.skc_prot->memory_allocated->counter;
+
+	sk_prot_ctx.memory_allocated = &memory_allocated;
+	sk_prot_ctx.memory_per_cpu_fw_alloc = memory_per_cpu_fw_alloc;
+
+	bpf_loop(nr_cpus, drain_memory_per_cpu_fw_alloc, &sk_prot_ctx, 0);
+
+	return memory_allocated;
+}
+
+static void fentry_init_sock(struct sock *sk, struct sock **sk_tracing,
+			     long *memory_allocated, int *memory_per_cpu_fw_alloc,
+			     bool *activated)
+{
+	if (!*activated)
+		return;
+
+	if (__sync_val_compare_and_swap(sk_tracing, NULL, sk))
+		return;
+
+	*activated = false;
+	*memory_allocated = get_memory_allocated(sk, memory_per_cpu_fw_alloc);
+}
+
+static void fexit_init_sock(struct sock *sk, struct sock **sk_tracing,
+			    long *memory_allocated, int *memory_per_cpu_fw_alloc,
+			    bool *stable)
+{
+	long new_memory_allocated;
+
+	if (sk != *sk_tracing)
+		return;
+
+	new_memory_allocated = get_memory_allocated(sk, memory_per_cpu_fw_alloc);
+	if (new_memory_allocated == *memory_allocated)
+		*stable = true;
+
+	*sk_tracing = NULL;
+}
+
+SEC("fentry/tcp_init_sock")
+int BPF_PROG(fentry_tcp_init_sock, struct sock *sk)
+{
+	fentry_init_sock(sk, &tcp_sk_tracing,
+			 &tcp_memory_allocated, &tcp_memory_per_cpu_fw_alloc,
+			 &tcp_activated);
+	return 0;
+}
+
+SEC("fexit/tcp_init_sock")
+int BPF_PROG(fexit_tcp_init_sock, struct sock *sk)
+{
+	fexit_init_sock(sk, &tcp_sk_tracing,
+			&tcp_memory_allocated, &tcp_memory_per_cpu_fw_alloc,
+			&tcp_stable);
+	return 0;
+}
+
+SEC("fentry/udp_init_sock")
+int BPF_PROG(fentry_udp_init_sock, struct sock *sk)
+{
+	fentry_init_sock(sk, &udp_sk_tracing,
+			 &udp_memory_allocated, &udp_memory_per_cpu_fw_alloc,
+			 &udp_activated);
+	return 0;
+}
+
+SEC("fexit/udp_init_sock")
+int BPF_PROG(fexit_udp_init_sock, struct sock *sk)
+{
+	fexit_init_sock(sk, &udp_sk_tracing,
+			&udp_memory_allocated, &udp_memory_per_cpu_fw_alloc,
+			&udp_stable);
+	return 0;
+}
+
+SEC("cgroup/sock_create")
+int sock_create(struct bpf_sock *ctx)
+{
+	u32 flags = SK_BPF_MEMCG_EXCLUSIVE;
+	int err;
+
+	err = bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS,
+			     &flags, sizeof(flags));
+	if (err)
+		goto err;
+
+	flags = 0;
+
+	err = bpf_getsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS,
+			     &flags, sizeof(flags));
+	if (err)
+		goto err;
+
+	if (flags != SK_BPF_MEMCG_EXCLUSIVE) {
+		err = -EINVAL;
+		goto err;
+	}
+
+	return 1;
+
+err:
+	bpf_set_retval(err);
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "GPL";
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v9 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE.
  2025-09-17 19:14 ` [PATCH v9 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE Kuniyuki Iwashima
@ 2025-09-17 23:38   ` Martin KaFai Lau
  2025-09-18  1:17     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 12+ messages in thread
From: Martin KaFai Lau @ 2025-09-17 23:38 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On 9/17/25 12:14 PM, Kuniyuki Iwashima wrote:
> The test does the following for IPv4/IPv6 x TCP/UDP sockets
> with/without SK_MEMCG_EXCLUSIVE, which can be turned on by
> net.core.memcg_exclusive or bpf_setsockopt(SK_BPF_MEMCG_EXCLUSIVE).
> 
>    1. Create socket pairs
>    2. Send a bunch of data that requires more than 1024 pages
>    3. Read memory_allocated from sk->sk_prot->memory_allocated and
>       sk->sk_prot->memory_per_cpu_fw_alloc
>    4. Check if unread data is charged to memory_allocated
> 
> If SK_MEMCG_EXCLUSIVE is set, memory_allocated should not be
> changed, but we allow a small error (up to 10 pages) in case
> other processes on the host use some amounts of TCP/UDP memory.
> 
> The amount of allocated pages are buffered to per-cpu variable
> {tcp,udp}_memory_per_cpu_fw_alloc up to +/- net.core.mem_pcpu_rsv
> before reported to {tcp,udp}_memory_allocated.
> 
> At 3., memory_allocated is calculated from the 2 variables twice
> at fentry and fexit of socket create function to check if the per-cpu
> value is drained during calculation.  In that case, 3. is retried.
> 
> We use kern_sync_rcu() for UDP because UDP recv queue is destroyed
> after RCU grace period.
> 
> The test takes ~2s on QEMU (64 CPUs) w/ KVM but takes 6s w/o KVM.
> 
>    # time ./test_progs -t sk_memcg
>    #370/1   sk_memcg/TCP  :OK
>    #370/2   sk_memcg/UDP  :OK
>    #370/3   sk_memcg/TCPv6:OK
>    #370/4   sk_memcg/UDPv6:OK
>    #370     sk_memcg:OK
>    Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED
> 
>    real	0m1.623s
>    user	0m0.165s
>    sys	0m0.366s
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> ---
> v7:
>    * Add test for sysctl
> 
> v6:
>    * Trace sk_prot->memory_allocated + sk_prot->memory_per_cpu_fw_alloc
> 
> v5:
>    * Use kern_sync_rcu()
>    * Double NR_SEND to 128
> 
> v4:
>    * Only use inet_create() hook
>    * Test bpf_getsockopt()
>    * Add serial_ prefix
>    * Reduce sleep() and the amount of sent data
> ---
>   .../selftests/bpf/prog_tests/sk_memcg.c       | 261 ++++++++++++++++++
>   tools/testing/selftests/bpf/progs/sk_memcg.c  | 146 ++++++++++
>   2 files changed, 407 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_memcg.c
>   create mode 100644 tools/testing/selftests/bpf/progs/sk_memcg.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/sk_memcg.c b/tools/testing/selftests/bpf/prog_tests/sk_memcg.c
> new file mode 100644
> index 000000000000..777fb81e9365
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/sk_memcg.c
> @@ -0,0 +1,261 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2025 Google LLC */
> +
> +#include <test_progs.h>
> +#include "sk_memcg.skel.h"
> +#include "network_helpers.h"
> +
> +#define NR_SOCKETS	64
> +#define NR_SEND		128
> +#define BUF_SINGLE	1024
> +#define BUF_TOTAL	(BUF_SINGLE * NR_SEND)
> +
> +struct test_case {
> +	char name[8];
> +	int family;
> +	int type;
> +	int (*create_sockets)(struct test_case *test_case, int sk[], int len);
> +	long (*get_memory_allocated)(struct test_case *test_case, struct sk_memcg *skel);
> +};
> +
> +static int tcp_create_sockets(struct test_case *test_case, int sk[], int len)
> +{
> +	int server, i;
> +
> +	server = start_server(test_case->family, test_case->type, NULL, 0, 0);
> +	ASSERT_GE(server, 0, "start_server_str");
> +
> +	for (i = 0; i < len / 2; i++) {
> +		sk[i * 2] = connect_to_fd(server, 0);
> +		if (!ASSERT_GE(sk[i * 2], 0, "connect_to_fd"))
> +			return sk[i * 2];
> +
> +		sk[i * 2 + 1] = accept(server, NULL, NULL);
> +		if (!ASSERT_GE(sk[i * 2 + 1], 0, "accept"))
> +			return sk[i * 2 + 1];
> +	}
> +
> +	close(server);
> +
> +	return 0;
> +}
> +
> +static int udp_create_sockets(struct test_case *test_case, int sk[], int len)
> +{
> +	int i, err, rcvbuf = BUF_TOTAL;
> +
> +	for (i = 0; i < len / 2; i++) {

nit. How about "for (i = 0; i < len; i += 2) {" once here instead of "i * 2" 
below. Same for the tcp_create_sockets() above.

> +		sk[i * 2] = start_server(test_case->family, test_case->type, NULL, 0, 0);
> +		if (!ASSERT_GE(sk[i * 2], 0, "start_server"))
> +			return sk[i * 2];
> +
> +		sk[i * 2 + 1] = connect_to_fd(sk[i * 2], 0);
> +		if (!ASSERT_GE(sk[i * 2 + 1], 0, "connect_to_fd"))
> +			return sk[i * 2 + 1];
> +
> +		err = connect_fd_to_fd(sk[i * 2], sk[i * 2 + 1], 0);
> +		if (!ASSERT_EQ(err, 0, "connect_fd_to_fd"))
> +			return err;
> +
> +		err = setsockopt(sk[i * 2], SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(int));
> +		if (!ASSERT_EQ(err, 0, "setsockopt(SO_RCVBUF)"))
> +			return err;
> +
> +		err = setsockopt(sk[i * 2 + 1], SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(int));
> +		if (!ASSERT_EQ(err, 0, "setsockopt(SO_RCVBUF)"))
> +			return err;
> +	}
> +
> +	return 0;
> +}
> +
> +static long get_memory_allocated(struct test_case *test_case,
> +				 bool *activated, bool *stable,
> +				 long *memory_allocated)
> +{
> +	*stable = false;
> +
> +	do {
> +		*activated = true;
> +
> +		/* AF_INET and AF_INET6 share the same memory_allocated.
> +		 * tcp_init_sock() is called by AF_INET and AF_INET6,
> +		 * but udp_lib_init_sock() is inline.
> +		 */
> +		socket(AF_INET, test_case->type, 0);

fd is leaked.

> +	} while (!*stable);

cannot loop forever. The test needs to assume the machine is relatively network 
quiet anyway (so serial_). Things can still change after the stable test also. I 
think having a way (the fentry in the progs/sk_memcg.c) to account for the 
percpu fw alloc is good enough, and this should help if there is some light 
background traffic that suddenly flush the hidden +255 percpu counter to the 
global one and another percpu counter still has a -254 for example.

> +
> +	return *memory_allocated;
> +}
> +
> +static long tcp_get_memory_allocated(struct test_case *test_case, struct sk_memcg *skel)
> +{
> +	return get_memory_allocated(test_case,
> +				    &skel->bss->tcp_activated,
> +				    &skel->bss->tcp_stable,
> +				    &skel->bss->tcp_memory_allocated);
> +}
> +
> +static long udp_get_memory_allocated(struct test_case *test_case, struct sk_memcg *skel)
> +{
> +	return get_memory_allocated(test_case,
> +				    &skel->bss->udp_activated,
> +				    &skel->bss->udp_stable,
> +				    &skel->bss->udp_memory_allocated);
> +}
> +
> +static int check_exclusive(struct test_case *test_case,
> +			   struct sk_memcg *skel, bool exclusive)
> +{
> +	char buf[BUF_SINGLE] = {};
> +	long memory_allocated[2];
> +	int sk[NR_SOCKETS] = {};
> +	int err, i, j;
> +
> +	err = test_case->create_sockets(test_case, sk, ARRAY_SIZE(sk));
> +	if (err)
> +		goto close;
> +
> +	memory_allocated[0] = test_case->get_memory_allocated(test_case, skel);
> +
> +	/* allocate pages >= 1024 */
> +	for (i = 0; i < ARRAY_SIZE(sk); i++) {
> +		for (j = 0; j < NR_SEND; j++) {
> +			int bytes = send(sk[i], buf, sizeof(buf), 0);
> +
> +			/* Avoid too noisy logs when something failed. */
> +			if (bytes != sizeof(buf)) {
> +				ASSERT_EQ(bytes, sizeof(buf), "send");
> +				if (bytes < 0) {
> +					err = bytes;
> +					goto close;
> +				}
> +			}
> +		}
> +	}
> +
> +	memory_allocated[1] = test_case->get_memory_allocated(test_case, skel);
> +
> +	if (exclusive)
> +		ASSERT_LE(memory_allocated[1], memory_allocated[0] + 10, "exclusive");
> +	else
> +		ASSERT_GT(memory_allocated[1], memory_allocated[0] + 1024, "not exclusive");The test is taking >10s in my environemnt. Although it has kasan and other dbg 
turned on, my environment is not a slow one tbh. The WATCHDOG > 10s warning is 
hit pretty often. The exclusive case is expecting +10. May be we just need to 
check +128 for non-exclusive which should be subtle enough to contrast with the 
exclusive case? With +128, NR_SEND 32 is more than enough?

> +
> +close:
> +	for (i = 0; i < ARRAY_SIZE(sk); i++)
> +		close(sk[i]);
> +
> +	if (test_case->type == SOCK_DGRAM) {
> +		/* UDP recv queue is destroyed after RCU grace period.
> +		 * With one kern_sync_rcu(), memory_allocated[0] of the
> +		 * isoalted case often matches with memory_allocated[1]
> +		 * of the preceding non-exclusive case.
> +		 */

I don't think I understand the double kern_sync_rcu() below.

> +		kern_sync_rcu();
> +		kern_sync_rcu();> +	}
> +
> +	return err;
> +}
> +
> +void run_test(struct test_case *test_case)

static

> +{
> +	struct nstoken *nstoken;
> +	struct sk_memcg *skel;
> +	int cgroup, err;
> +
> +	skel = sk_memcg__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "open_and_load"))
> +		return;
> +
> +	skel->bss->nr_cpus = libbpf_num_possible_cpus();
> +
> +	err = sk_memcg__attach(skel);
> +	if (!ASSERT_OK(err, "attach"))
> +		goto destroy_skel;
> +
> +	cgroup = test__join_cgroup("/sk_memcg");
> +	if (!ASSERT_GE(cgroup, 0, "join_cgroup"))
> +		goto destroy_skel;
> +
> +	err = make_netns("sk_memcg");
> +	if (!ASSERT_EQ(err, 0, "make_netns"))
> +		goto close_cgroup;
> +
> +	nstoken = open_netns("sk_memcg");
> +	if (!ASSERT_OK_PTR(nstoken, "open_netns"))
> +		goto remove_netns;
> +
> +	err = check_exclusive(test_case, skel, false);
> +	if (!ASSERT_EQ(err, 0, "test_exclusive(false)"))
> +		goto close_netns;
> +
> +	err = write_sysctl("/proc/sys/net/core/memcg_exclusive", "1");
> +	if (!ASSERT_EQ(err, 0, "write_sysctl(1)"))
> +		goto close_netns;
> +
> +	err = check_exclusive(test_case, skel, true);
> +	if (!ASSERT_EQ(err, 0, "test_exclusive(true by sysctl)"))
> +		goto close_netns;
> +
> +	err = write_sysctl("/proc/sys/net/core/memcg_exclusive", "0");
> +	if (!ASSERT_EQ(err, 0, "write_sysctl(0)"))
> +		goto close_netns;
> +
> +	skel->links.sock_create = bpf_program__attach_cgroup(skel->progs.sock_create, cgroup);
> +	if (!ASSERT_OK_PTR(skel->links.sock_create, "attach_cgroup(sock_create)"))
> +		goto close_netns;
> +
> +	err = check_exclusive(test_case, skel, true);
> +	ASSERT_EQ(err, 0, "test_exclusive(true by bpf)");
> +
> +close_netns:
> +	close_netns(nstoken);
> +remove_netns:
> +	remove_netns("sk_memcg");
> +close_cgroup:
> +	close(cgroup);
> +destroy_skel:
> +	sk_memcg__destroy(skel);
> +}
> +
> +struct test_case test_cases[] = {
> +	{
> +		.name = "TCP  ",
> +		.family = AF_INET,
> +		.type = SOCK_STREAM,
> +		.create_sockets = tcp_create_sockets,
> +		.get_memory_allocated = tcp_get_memory_allocated,
> +	},
> +	{
> +		.name = "UDP  ",
> +		.family = AF_INET,
> +		.type = SOCK_DGRAM,
> +		.create_sockets = udp_create_sockets,
> +		.get_memory_allocated = udp_get_memory_allocated,
> +	},
> +	{
> +		.name = "TCPv6",
> +		.family = AF_INET6,
> +		.type = SOCK_STREAM,
> +		.create_sockets = tcp_create_sockets,
> +		.get_memory_allocated = tcp_get_memory_allocated,
> +	},
> +	{
> +		.name = "UDPv6",
> +		.family = AF_INET6,
> +		.type = SOCK_DGRAM,
> +		.create_sockets = udp_create_sockets,
> +		.get_memory_allocated = udp_get_memory_allocated,
> +	},
> +};
> +
> +void serial_test_sk_memcg(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
> +		test__start_subtest(test_cases[i].name);

This is not doing anything without "if".

> +		run_test(&test_cases[i]);
> +	}
> +}
> diff --git a/tools/testing/selftests/bpf/progs/sk_memcg.c b/tools/testing/selftests/bpf/progs/sk_memcg.c
> new file mode 100644
> index 000000000000..6b1a928a0c90
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/sk_memcg.c
> @@ -0,0 +1,146 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2025 Google LLC */
> +
> +#include "bpf_tracing_net.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include <errno.h>
> +
> +extern int tcp_memory_per_cpu_fw_alloc __ksym;
> +extern int udp_memory_per_cpu_fw_alloc __ksym;
> +
> +int nr_cpus;
> +bool tcp_activated, tcp_stable, udp_activated, udp_stable;
> +long tcp_memory_allocated, udp_memory_allocated;
> +static struct sock *tcp_sk_tracing, *udp_sk_tracing;
> +
> +struct sk_prot {
> +	long *memory_allocated;
> +	int *memory_per_cpu_fw_alloc;
> +};
> +
> +static int drain_memory_per_cpu_fw_alloc(__u32 i, struct sk_prot *sk_prot_ctx)
> +{
> +	int *memory_per_cpu_fw_alloc;
> +
> +	memory_per_cpu_fw_alloc = bpf_per_cpu_ptr(sk_prot_ctx->memory_per_cpu_fw_alloc, i);
> +	if (memory_per_cpu_fw_alloc)
> +		*sk_prot_ctx->memory_allocated += *memory_per_cpu_fw_alloc;
> +
> +	return 0;
> +}
> +
> +static long get_memory_allocated(struct sock *_sk, int *memory_per_cpu_fw_alloc)
> +{
> +	struct sock *sk = bpf_core_cast(_sk, struct sock);
> +	struct sk_prot sk_prot_ctx;
> +	long memory_allocated;
> +
> +	/* net_aligned_data.{tcp,udp}_memory_allocated was not available. */
> +	memory_allocated = sk->__sk_common.skc_prot->memory_allocated->counter;
> +
> +	sk_prot_ctx.memory_allocated = &memory_allocated;
> +	sk_prot_ctx.memory_per_cpu_fw_alloc = memory_per_cpu_fw_alloc;
> +
> +	bpf_loop(nr_cpus, drain_memory_per_cpu_fw_alloc, &sk_prot_ctx, 0);
> +
> +	return memory_allocated;
> +}
> +
> +static void fentry_init_sock(struct sock *sk, struct sock **sk_tracing,
> +			     long *memory_allocated, int *memory_per_cpu_fw_alloc,
> +			     bool *activated)
> +{
> +	if (!*activated)
> +		return;
> +
> +	if (__sync_val_compare_and_swap(sk_tracing, NULL, sk))
> +		return;
> +
> +	*activated = false;
> +	*memory_allocated = get_memory_allocated(sk, memory_per_cpu_fw_alloc);
> +}
> +
> +static void fexit_init_sock(struct sock *sk, struct sock **sk_tracing,
> +			    long *memory_allocated, int *memory_per_cpu_fw_alloc,
> +			    bool *stable)
> +{
> +	long new_memory_allocated;
> +
> +	if (sk != *sk_tracing)
> +		return;
> +
> +	new_memory_allocated = get_memory_allocated(sk, memory_per_cpu_fw_alloc);
> +	if (new_memory_allocated == *memory_allocated)
> +		*stable = true;

I am not sure that help. The total memory_allocated can still change after this. 
I would just grab the total in fentry once and then move on without confirming 
in fexit.

> +
> +	*sk_tracing = NULL;
> +}
> +
> +SEC("fentry/tcp_init_sock")
> +int BPF_PROG(fentry_tcp_init_sock, struct sock *sk)
> +{
> +	fentry_init_sock(sk, &tcp_sk_tracing,
> +			 &tcp_memory_allocated, &tcp_memory_per_cpu_fw_alloc,
> +			 &tcp_activated);
> +	return 0;
> +}
> +
> +SEC("fexit/tcp_init_sock")
> +int BPF_PROG(fexit_tcp_init_sock, struct sock *sk)
> +{
> +	fexit_init_sock(sk, &tcp_sk_tracing,
> +			&tcp_memory_allocated, &tcp_memory_per_cpu_fw_alloc,
> +			&tcp_stable);
> +	return 0;
> +}
> +
> +SEC("fentry/udp_init_sock")
> +int BPF_PROG(fentry_udp_init_sock, struct sock *sk)
> +{
> +	fentry_init_sock(sk, &udp_sk_tracing,
> +			 &udp_memory_allocated, &udp_memory_per_cpu_fw_alloc,
> +			 &udp_activated);
> +	return 0;
> +}
> +
> +SEC("fexit/udp_init_sock")
> +int BPF_PROG(fexit_udp_init_sock, struct sock *sk)
> +{
> +	fexit_init_sock(sk, &udp_sk_tracing,
> +			&udp_memory_allocated, &udp_memory_per_cpu_fw_alloc,
> +			&udp_stable);
> +	return 0;
> +}
> +
> +SEC("cgroup/sock_create")
> +int sock_create(struct bpf_sock *ctx)
> +{
> +	u32 flags = SK_BPF_MEMCG_EXCLUSIVE;
> +	int err;
> +
> +	err = bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS,
> +			     &flags, sizeof(flags));
> +	if (err)
> +		goto err;
> +
> +	flags = 0;
> +
> +	err = bpf_getsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS,
> +			     &flags, sizeof(flags));
> +	if (err)
> +		goto err;
> +
> +	if (flags != SK_BPF_MEMCG_EXCLUSIVE) {
> +		err = -EINVAL;
> +		goto err;
> +	}
> +
> +	return 1;
> +
> +err:
> +	bpf_set_retval(err);
> +	return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "GPL";



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v9 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE.
  2025-09-17 23:38   ` Martin KaFai Lau
@ 2025-09-18  1:17     ` Kuniyuki Iwashima
  2025-09-19  1:14       ` Martin KaFai Lau
  0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-18  1:17 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On Wed, Sep 17, 2025 at 4:38 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 9/17/25 12:14 PM, Kuniyuki Iwashima wrote:
> > The test does the following for IPv4/IPv6 x TCP/UDP sockets
> > with/without SK_MEMCG_EXCLUSIVE, which can be turned on by
> > net.core.memcg_exclusive or bpf_setsockopt(SK_BPF_MEMCG_EXCLUSIVE).
> >
> >    1. Create socket pairs
> >    2. Send a bunch of data that requires more than 1024 pages
> >    3. Read memory_allocated from sk->sk_prot->memory_allocated and
> >       sk->sk_prot->memory_per_cpu_fw_alloc
> >    4. Check if unread data is charged to memory_allocated
> >
> > If SK_MEMCG_EXCLUSIVE is set, memory_allocated should not be
> > changed, but we allow a small error (up to 10 pages) in case
> > other processes on the host use some amounts of TCP/UDP memory.
> >
> > The amount of allocated pages are buffered to per-cpu variable
> > {tcp,udp}_memory_per_cpu_fw_alloc up to +/- net.core.mem_pcpu_rsv
> > before reported to {tcp,udp}_memory_allocated.
> >
> > At 3., memory_allocated is calculated from the 2 variables twice
> > at fentry and fexit of socket create function to check if the per-cpu
> > value is drained during calculation.  In that case, 3. is retried.
> >
> > We use kern_sync_rcu() for UDP because UDP recv queue is destroyed
> > after RCU grace period.
> >
> > The test takes ~2s on QEMU (64 CPUs) w/ KVM but takes 6s w/o KVM.
> >
> >    # time ./test_progs -t sk_memcg
> >    #370/1   sk_memcg/TCP  :OK
> >    #370/2   sk_memcg/UDP  :OK
> >    #370/3   sk_memcg/TCPv6:OK
> >    #370/4   sk_memcg/UDPv6:OK
> >    #370     sk_memcg:OK
> >    Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED
> >
> >    real       0m1.623s
> >    user       0m0.165s
> >    sys        0m0.366s
> >
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> > ---
> > v7:
> >    * Add test for sysctl
> >
> > v6:
> >    * Trace sk_prot->memory_allocated + sk_prot->memory_per_cpu_fw_alloc
> >
> > v5:
> >    * Use kern_sync_rcu()
> >    * Double NR_SEND to 128
> >
> > v4:
> >    * Only use inet_create() hook
> >    * Test bpf_getsockopt()
> >    * Add serial_ prefix
> >    * Reduce sleep() and the amount of sent data
> > ---
> >   .../selftests/bpf/prog_tests/sk_memcg.c       | 261 ++++++++++++++++++
> >   tools/testing/selftests/bpf/progs/sk_memcg.c  | 146 ++++++++++
> >   2 files changed, 407 insertions(+)
> >   create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_memcg.c
> >   create mode 100644 tools/testing/selftests/bpf/progs/sk_memcg.c
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/sk_memcg.c b/tools/testing/selftests/bpf/prog_tests/sk_memcg.c
> > new file mode 100644
> > index 000000000000..777fb81e9365
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/sk_memcg.c
> > @@ -0,0 +1,261 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright 2025 Google LLC */
> > +
> > +#include <test_progs.h>
> > +#include "sk_memcg.skel.h"
> > +#include "network_helpers.h"
> > +
> > +#define NR_SOCKETS   64
> > +#define NR_SEND              128
> > +#define BUF_SINGLE   1024
> > +#define BUF_TOTAL    (BUF_SINGLE * NR_SEND)
> > +
> > +struct test_case {
> > +     char name[8];
> > +     int family;
> > +     int type;
> > +     int (*create_sockets)(struct test_case *test_case, int sk[], int len);
> > +     long (*get_memory_allocated)(struct test_case *test_case, struct sk_memcg *skel);
> > +};
> > +
> > +static int tcp_create_sockets(struct test_case *test_case, int sk[], int len)
> > +{
> > +     int server, i;
> > +
> > +     server = start_server(test_case->family, test_case->type, NULL, 0, 0);
> > +     ASSERT_GE(server, 0, "start_server_str");
> > +
> > +     for (i = 0; i < len / 2; i++) {
> > +             sk[i * 2] = connect_to_fd(server, 0);
> > +             if (!ASSERT_GE(sk[i * 2], 0, "connect_to_fd"))
> > +                     return sk[i * 2];
> > +
> > +             sk[i * 2 + 1] = accept(server, NULL, NULL);
> > +             if (!ASSERT_GE(sk[i * 2 + 1], 0, "accept"))
> > +                     return sk[i * 2 + 1];
> > +     }
> > +
> > +     close(server);
> > +
> > +     return 0;
> > +}
> > +
> > +static int udp_create_sockets(struct test_case *test_case, int sk[], int len)
> > +{
> > +     int i, err, rcvbuf = BUF_TOTAL;
> > +
> > +     for (i = 0; i < len / 2; i++) {
>
> nit. How about "for (i = 0; i < len; i += 2) {" once here instead of "i * 2"
> below. Same for the tcp_create_sockets() above.

Will do.


>
> > +             sk[i * 2] = start_server(test_case->family, test_case->type, NULL, 0, 0);
> > +             if (!ASSERT_GE(sk[i * 2], 0, "start_server"))
> > +                     return sk[i * 2];
> > +
> > +             sk[i * 2 + 1] = connect_to_fd(sk[i * 2], 0);
> > +             if (!ASSERT_GE(sk[i * 2 + 1], 0, "connect_to_fd"))
> > +                     return sk[i * 2 + 1];
> > +
> > +             err = connect_fd_to_fd(sk[i * 2], sk[i * 2 + 1], 0);
> > +             if (!ASSERT_EQ(err, 0, "connect_fd_to_fd"))
> > +                     return err;
> > +
> > +             err = setsockopt(sk[i * 2], SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(int));
> > +             if (!ASSERT_EQ(err, 0, "setsockopt(SO_RCVBUF)"))
> > +                     return err;
> > +
> > +             err = setsockopt(sk[i * 2 + 1], SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(int));
> > +             if (!ASSERT_EQ(err, 0, "setsockopt(SO_RCVBUF)"))
> > +                     return err;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static long get_memory_allocated(struct test_case *test_case,
> > +                              bool *activated, bool *stable,
> > +                              long *memory_allocated)
> > +{
> > +     *stable = false;
> > +
> > +     do {
> > +             *activated = true;
> > +
> > +             /* AF_INET and AF_INET6 share the same memory_allocated.
> > +              * tcp_init_sock() is called by AF_INET and AF_INET6,
> > +              * but udp_lib_init_sock() is inline.
> > +              */
> > +             socket(AF_INET, test_case->type, 0);
>
> fd is leaked.

Will close().


>
> > +     } while (!*stable);
>
> cannot loop forever. The test needs to assume the machine is relatively network
> quiet anyway (so serial_). Things can still change after the stable test also. I
> think having a way (the fentry in the progs/sk_memcg.c) to account for the
> percpu fw alloc is good enough, and this should help if there is some light
> background traffic that suddenly flush the hidden +255 percpu counter to the
> global one and another percpu counter still has a -254 for example.

Okay, I'll remove fexit progs.


>
> > +
> > +     return *memory_allocated;
> > +}
> > +
> > +static long tcp_get_memory_allocated(struct test_case *test_case, struct sk_memcg *skel)
> > +{
> > +     return get_memory_allocated(test_case,
> > +                                 &skel->bss->tcp_activated,
> > +                                 &skel->bss->tcp_stable,
> > +                                 &skel->bss->tcp_memory_allocated);
> > +}
> > +
> > +static long udp_get_memory_allocated(struct test_case *test_case, struct sk_memcg *skel)
> > +{
> > +     return get_memory_allocated(test_case,
> > +                                 &skel->bss->udp_activated,
> > +                                 &skel->bss->udp_stable,
> > +                                 &skel->bss->udp_memory_allocated);
> > +}
> > +
> > +static int check_exclusive(struct test_case *test_case,
> > +                        struct sk_memcg *skel, bool exclusive)
> > +{
> > +     char buf[BUF_SINGLE] = {};
> > +     long memory_allocated[2];
> > +     int sk[NR_SOCKETS] = {};
> > +     int err, i, j;
> > +
> > +     err = test_case->create_sockets(test_case, sk, ARRAY_SIZE(sk));
> > +     if (err)
> > +             goto close;
> > +
> > +     memory_allocated[0] = test_case->get_memory_allocated(test_case, skel);
> > +
> > +     /* allocate pages >= 1024 */
> > +     for (i = 0; i < ARRAY_SIZE(sk); i++) {
> > +             for (j = 0; j < NR_SEND; j++) {
> > +                     int bytes = send(sk[i], buf, sizeof(buf), 0);
> > +
> > +                     /* Avoid too noisy logs when something failed. */
> > +                     if (bytes != sizeof(buf)) {
> > +                             ASSERT_EQ(bytes, sizeof(buf), "send");
> > +                             if (bytes < 0) {
> > +                                     err = bytes;
> > +                                     goto close;
> > +                             }
> > +                     }
> > +             }
> > +     }
> > +
> > +     memory_allocated[1] = test_case->get_memory_allocated(test_case, skel);
> > +
> > +     if (exclusive)
> > +             ASSERT_LE(memory_allocated[1], memory_allocated[0] + 10, "exclusive");
> > +     else
> > +             ASSERT_GT(memory_allocated[1], memory_allocated[0] + 1024, "not exclusive");The test is taking >10s in my environemnt. Although it has kasan and other dbg
> turned on, my environment is not a slow one tbh. The WATCHDOG > 10s warning is
> hit pretty often. The exclusive case is expecting +10. May be we just need to
> check +128 for non-exclusive which should be subtle enough to contrast with the
> exclusive case? With +128, NR_SEND 32 is more than enough?

I think I was too conservative, and the number should be
fine with fentry progs.


>
> > +
> > +close:
> > +     for (i = 0; i < ARRAY_SIZE(sk); i++)
> > +             close(sk[i]);
> > +
> > +     if (test_case->type == SOCK_DGRAM) {
> > +             /* UDP recv queue is destroyed after RCU grace period.
> > +              * With one kern_sync_rcu(), memory_allocated[0] of the
> > +              * isoalted case often matches with memory_allocated[1]
> > +              * of the preceding non-exclusive case.
> > +              */
>
> I don't think I understand the double kern_sync_rcu() below.

With one kern_sync_rcu(), when I added bpf_printk() for memory_allocated,
I sometimes saw two consecutive non-zero values, meaning memory_allocated[0]
still see the previous test case result (memory_allocated[1]).
ASSERT_LE() succeeds as expected, but somewhat unintentionally.

bpf_trace_printk: memory_allocated: 0 <-- non exclusive case
bpf_trace_printk: memory_allocated: 4160
bpf_trace_printk: memory_allocated: 4160 <-- exclusive case's
memory_allocated[0]
bpf_trace_printk: memory_allocated: 0
bpf_trace_printk: memory_allocated: 0
bpf_trace_printk: memory_allocated: 0

One kern_sync_rcu() is enough to kick call_rcu() + sk_destruct() but
does not guarantee that it completes, so if the queue length was too long,
the memory_allocated does not go down fast enough.

But now I don't see the flakiness with NR_SEND 32, and one
kern_sync_rcu() might be enough unless the env is too slow...?



>
> > +             kern_sync_rcu();
> > +             kern_sync_rcu();> +     }
> > +
> > +     return err;
> > +}
> > +
> > +void run_test(struct test_case *test_case)
>
> static

Will add.


>
> > +{
> > +     struct nstoken *nstoken;
> > +     struct sk_memcg *skel;
> > +     int cgroup, err;
> > +
> > +     skel = sk_memcg__open_and_load();
> > +     if (!ASSERT_OK_PTR(skel, "open_and_load"))
> > +             return;
> > +
> > +     skel->bss->nr_cpus = libbpf_num_possible_cpus();
> > +
> > +     err = sk_memcg__attach(skel);
> > +     if (!ASSERT_OK(err, "attach"))
> > +             goto destroy_skel;
> > +
> > +     cgroup = test__join_cgroup("/sk_memcg");
> > +     if (!ASSERT_GE(cgroup, 0, "join_cgroup"))
> > +             goto destroy_skel;
> > +
> > +     err = make_netns("sk_memcg");
> > +     if (!ASSERT_EQ(err, 0, "make_netns"))
> > +             goto close_cgroup;
> > +
> > +     nstoken = open_netns("sk_memcg");
> > +     if (!ASSERT_OK_PTR(nstoken, "open_netns"))
> > +             goto remove_netns;
> > +
> > +     err = check_exclusive(test_case, skel, false);
> > +     if (!ASSERT_EQ(err, 0, "test_exclusive(false)"))
> > +             goto close_netns;
> > +
> > +     err = write_sysctl("/proc/sys/net/core/memcg_exclusive", "1");
> > +     if (!ASSERT_EQ(err, 0, "write_sysctl(1)"))
> > +             goto close_netns;
> > +
> > +     err = check_exclusive(test_case, skel, true);
> > +     if (!ASSERT_EQ(err, 0, "test_exclusive(true by sysctl)"))
> > +             goto close_netns;
> > +
> > +     err = write_sysctl("/proc/sys/net/core/memcg_exclusive", "0");
> > +     if (!ASSERT_EQ(err, 0, "write_sysctl(0)"))
> > +             goto close_netns;
> > +
> > +     skel->links.sock_create = bpf_program__attach_cgroup(skel->progs.sock_create, cgroup);
> > +     if (!ASSERT_OK_PTR(skel->links.sock_create, "attach_cgroup(sock_create)"))
> > +             goto close_netns;
> > +
> > +     err = check_exclusive(test_case, skel, true);
> > +     ASSERT_EQ(err, 0, "test_exclusive(true by bpf)");
> > +
> > +close_netns:
> > +     close_netns(nstoken);
> > +remove_netns:
> > +     remove_netns("sk_memcg");
> > +close_cgroup:
> > +     close(cgroup);
> > +destroy_skel:
> > +     sk_memcg__destroy(skel);
> > +}
> > +
> > +struct test_case test_cases[] = {
> > +     {
> > +             .name = "TCP  ",
> > +             .family = AF_INET,
> > +             .type = SOCK_STREAM,
> > +             .create_sockets = tcp_create_sockets,
> > +             .get_memory_allocated = tcp_get_memory_allocated,
> > +     },
> > +     {
> > +             .name = "UDP  ",
> > +             .family = AF_INET,
> > +             .type = SOCK_DGRAM,
> > +             .create_sockets = udp_create_sockets,
> > +             .get_memory_allocated = udp_get_memory_allocated,
> > +     },
> > +     {
> > +             .name = "TCPv6",
> > +             .family = AF_INET6,
> > +             .type = SOCK_STREAM,
> > +             .create_sockets = tcp_create_sockets,
> > +             .get_memory_allocated = tcp_get_memory_allocated,
> > +     },
> > +     {
> > +             .name = "UDPv6",
> > +             .family = AF_INET6,
> > +             .type = SOCK_DGRAM,
> > +             .create_sockets = udp_create_sockets,
> > +             .get_memory_allocated = udp_get_memory_allocated,
> > +     },
> > +};
> > +
> > +void serial_test_sk_memcg(void)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
> > +             test__start_subtest(test_cases[i].name);
>
> This is not doing anything without "if".

Ah, will move run_test() inside the if.

>
> > +             run_test(&test_cases[i]);
> > +     }
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/sk_memcg.c b/tools/testing/selftests/bpf/progs/sk_memcg.c
> > new file mode 100644
> > index 000000000000..6b1a928a0c90
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/sk_memcg.c
> > @@ -0,0 +1,146 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright 2025 Google LLC */
> > +
> > +#include "bpf_tracing_net.h"
> > +#include <bpf/bpf_helpers.h>
> > +#include <bpf/bpf_tracing.h>
> > +#include <errno.h>
> > +
> > +extern int tcp_memory_per_cpu_fw_alloc __ksym;
> > +extern int udp_memory_per_cpu_fw_alloc __ksym;
> > +
> > +int nr_cpus;
> > +bool tcp_activated, tcp_stable, udp_activated, udp_stable;
> > +long tcp_memory_allocated, udp_memory_allocated;
> > +static struct sock *tcp_sk_tracing, *udp_sk_tracing;
> > +
> > +struct sk_prot {
> > +     long *memory_allocated;
> > +     int *memory_per_cpu_fw_alloc;
> > +};
> > +
> > +static int drain_memory_per_cpu_fw_alloc(__u32 i, struct sk_prot *sk_prot_ctx)
> > +{
> > +     int *memory_per_cpu_fw_alloc;
> > +
> > +     memory_per_cpu_fw_alloc = bpf_per_cpu_ptr(sk_prot_ctx->memory_per_cpu_fw_alloc, i);
> > +     if (memory_per_cpu_fw_alloc)
> > +             *sk_prot_ctx->memory_allocated += *memory_per_cpu_fw_alloc;
> > +
> > +     return 0;
> > +}
> > +
> > +static long get_memory_allocated(struct sock *_sk, int *memory_per_cpu_fw_alloc)
> > +{
> > +     struct sock *sk = bpf_core_cast(_sk, struct sock);
> > +     struct sk_prot sk_prot_ctx;
> > +     long memory_allocated;
> > +
> > +     /* net_aligned_data.{tcp,udp}_memory_allocated was not available. */
> > +     memory_allocated = sk->__sk_common.skc_prot->memory_allocated->counter;
> > +
> > +     sk_prot_ctx.memory_allocated = &memory_allocated;
> > +     sk_prot_ctx.memory_per_cpu_fw_alloc = memory_per_cpu_fw_alloc;
> > +
> > +     bpf_loop(nr_cpus, drain_memory_per_cpu_fw_alloc, &sk_prot_ctx, 0);
> > +
> > +     return memory_allocated;
> > +}
> > +
> > +static void fentry_init_sock(struct sock *sk, struct sock **sk_tracing,
> > +                          long *memory_allocated, int *memory_per_cpu_fw_alloc,
> > +                          bool *activated)
> > +{
> > +     if (!*activated)
> > +             return;
> > +
> > +     if (__sync_val_compare_and_swap(sk_tracing, NULL, sk))
> > +             return;
> > +
> > +     *activated = false;
> > +     *memory_allocated = get_memory_allocated(sk, memory_per_cpu_fw_alloc);
> > +}
> > +
> > +static void fexit_init_sock(struct sock *sk, struct sock **sk_tracing,
> > +                         long *memory_allocated, int *memory_per_cpu_fw_alloc,
> > +                         bool *stable)
> > +{
> > +     long new_memory_allocated;
> > +
> > +     if (sk != *sk_tracing)
> > +             return;
> > +
> > +     new_memory_allocated = get_memory_allocated(sk, memory_per_cpu_fw_alloc);
> > +     if (new_memory_allocated == *memory_allocated)
> > +             *stable = true;
>
> I am not sure that help. The total memory_allocated can still change after this.
> I would just grab the total in fentry once and then move on without confirming
> in fexit.

Will remove fexit stuff.

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v9 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE.
  2025-09-18  1:17     ` Kuniyuki Iwashima
@ 2025-09-19  1:14       ` Martin KaFai Lau
  2025-09-19  2:28         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 12+ messages in thread
From: Martin KaFai Lau @ 2025-09-19  1:14 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On 9/17/25 6:17 PM, Kuniyuki Iwashima wrote:
>>> +
>>> +close:
>>> +     for (i = 0; i < ARRAY_SIZE(sk); i++)
>>> +             close(sk[i]);
>>> +
>>> +     if (test_case->type == SOCK_DGRAM) {
>>> +             /* UDP recv queue is destroyed after RCU grace period.
>>> +              * With one kern_sync_rcu(), memory_allocated[0] of the
>>> +              * isoalted case often matches with memory_allocated[1]
>>> +              * of the preceding non-exclusive case.
>>> +              */
>> I don't think I understand the double kern_sync_rcu() below.
> With one kern_sync_rcu(), when I added bpf_printk() for memory_allocated,
> I sometimes saw two consecutive non-zero values, meaning memory_allocated[0]
> still see the previous test case result (memory_allocated[1]).
> ASSERT_LE() succeeds as expected, but somewhat unintentionally.
> 
> bpf_trace_printk: memory_allocated: 0 <-- non exclusive case
> bpf_trace_printk: memory_allocated: 4160
> bpf_trace_printk: memory_allocated: 4160 <-- exclusive case's
> memory_allocated[0]
> bpf_trace_printk: memory_allocated: 0
> bpf_trace_printk: memory_allocated: 0
> bpf_trace_printk: memory_allocated: 0
> 
> One kern_sync_rcu() is enough to kick call_rcu() + sk_destruct() but
> does not guarantee that it completes, so if the queue length was too long,
> the memory_allocated does not go down fast enough.
> 
> But now I don't see the flakiness with NR_SEND 32, and one
> kern_sync_rcu() might be enough unless the env is too slow...?

Ah, got it. I put you in the wrong path. It needs rcu_barrier() instead.

Is recv() enough? May be just recv(MSG_DONTWAIT) to consume it only for UDP 
socket? that will slow down the udp test... only read 1 byte and the remaining 
can be MSG_TRUNC?

btw, does the test need 64 sockets? is it because of the per socket snd/rcvbuf 
limitation?

Another option is to trace SEC("fexit/__sk_destruct") to ensure all the cleanup 
is done but seems overkill if recv() can do.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v9 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE.
  2025-09-19  1:14       ` Martin KaFai Lau
@ 2025-09-19  2:28         ` Kuniyuki Iwashima
  2025-09-19  2:43           ` Kuniyuki Iwashima
  0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-19  2:28 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On Thu, Sep 18, 2025 at 6:14 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 9/17/25 6:17 PM, Kuniyuki Iwashima wrote:
> >>> +
> >>> +close:
> >>> +     for (i = 0; i < ARRAY_SIZE(sk); i++)
> >>> +             close(sk[i]);
> >>> +
> >>> +     if (test_case->type == SOCK_DGRAM) {
> >>> +             /* UDP recv queue is destroyed after RCU grace period.
> >>> +              * With one kern_sync_rcu(), memory_allocated[0] of the
> >>> +              * isoalted case often matches with memory_allocated[1]
> >>> +              * of the preceding non-exclusive case.
> >>> +              */
> >> I don't think I understand the double kern_sync_rcu() below.
> > With one kern_sync_rcu(), when I added bpf_printk() for memory_allocated,
> > I sometimes saw two consecutive non-zero values, meaning memory_allocated[0]
> > still see the previous test case result (memory_allocated[1]).
> > ASSERT_LE() succeeds as expected, but somewhat unintentionally.
> >
> > bpf_trace_printk: memory_allocated: 0 <-- non exclusive case
> > bpf_trace_printk: memory_allocated: 4160
> > bpf_trace_printk: memory_allocated: 4160 <-- exclusive case's
> > memory_allocated[0]
> > bpf_trace_printk: memory_allocated: 0
> > bpf_trace_printk: memory_allocated: 0
> > bpf_trace_printk: memory_allocated: 0
> >
> > One kern_sync_rcu() is enough to kick call_rcu() + sk_destruct() but
> > does not guarantee that it completes, so if the queue length was too long,
> > the memory_allocated does not go down fast enough.
> >
> > But now I don't see the flakiness with NR_SEND 32, and one
> > kern_sync_rcu() might be enough unless the env is too slow...?
>
> Ah, got it. I put you in the wrong path. It needs rcu_barrier() instead.
>
> Is recv() enough? May be just recv(MSG_DONTWAIT) to consume it only for UDP
> socket? that will slow down the udp test... only read 1 byte and the remaining
> can be MSG_TRUNC?

recv() before close() seems to work with fewer sockets (with the same
bytes of send()s and without kern_sync_rcu()).

And the test time was mostly the same with "no recv() + 1 kern_sync_rcu()"
case (2~2.5s on my qemu), so draining seems better.

#define NR_PAGES        128
#define NR_SOCKETS      2
#define BUF_TOTAL       (NR_PAGES * 4096 / (NR_SOCKETS / 2))
#define BUF_SINGLE      1024
#define NR_SEND         (BUF_TOTAL / BUF_SINGLE)

NR_SOCKET==64
      test_progs-1380    [008] ...11  2121.731176: bpf_trace_printk:
memory_allocated: 0
      test_progs-1380    [008] ...11  2121.737700: bpf_trace_printk:
memory_allocated: 576
      test_progs-1380    [008] ...11  2121.743436: bpf_trace_printk:
memory_allocated: 64
      test_progs-1380    [008] ...11  2121.749984: bpf_trace_printk:
memory_allocated: 64
      test_progs-1380    [008] ...11  2121.755763: bpf_trace_printk:
memory_allocated: 64
      test_progs-1380    [008] ...11  2121.762171: bpf_trace_printk:
memory_allocated: 64

NR_SOCKET==2
      test_progs-1424    [009] ...11  2352.990520: bpf_trace_printk:
memory_allocated: 0
      test_progs-1424    [009] ...11  2352.997395: bpf_trace_printk:
memory_allocated: 514
      test_progs-1424    [009] ...11  2353.001910: bpf_trace_printk:
memory_allocated: 2
      test_progs-1424    [009] ...11  2353.008947: bpf_trace_printk:
memory_allocated: 2
      test_progs-1424    [009] ...11  2353.012988: bpf_trace_printk:
memory_allocated: 2
      test_progs-1424    [009] ...11  2353.019988: bpf_trace_printk:
memory_allocated: 0

While NR_PAGES sets to 128, the actual allocated page was around 300 for
TCP and 500 for UDP.  I reduced NR_PAGES to 64, then TCP consumed 160
pages, and UDP consumed 250, but test time didn't change.

>
> btw, does the test need 64 sockets? is it because of the per socket snd/rcvbuf
> limitation?

I chose 64 for UDP that does not tune rcvbuf automatically, but I saw an error
when I increased the number of send() bytes and set SO_RCVBUF anyway,
so any number of sockets is fine now.

I'll use 2 sockets but will keep for-loops so that we can change NR_SOCKETS
easily when the test gets flaky.


>
> Another option is to trace SEC("fexit/__sk_destruct") to ensure all the cleanup
> is done but seems overkill if recv() can do.

Agreed, draining before close() would be enough.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v9 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE.
  2025-09-19  2:28         ` Kuniyuki Iwashima
@ 2025-09-19  2:43           ` Kuniyuki Iwashima
  0 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-19  2:43 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Neal Cardwell, Willem de Bruijn,
	Mina Almasry, Kuniyuki Iwashima, bpf, netdev

On Thu, Sep 18, 2025 at 7:28 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> On Thu, Sep 18, 2025 at 6:14 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 9/17/25 6:17 PM, Kuniyuki Iwashima wrote:
> > >>> +
> > >>> +close:
> > >>> +     for (i = 0; i < ARRAY_SIZE(sk); i++)
> > >>> +             close(sk[i]);
> > >>> +
> > >>> +     if (test_case->type == SOCK_DGRAM) {
> > >>> +             /* UDP recv queue is destroyed after RCU grace period.
> > >>> +              * With one kern_sync_rcu(), memory_allocated[0] of the
> > >>> +              * isoalted case often matches with memory_allocated[1]
> > >>> +              * of the preceding non-exclusive case.
> > >>> +              */
> > >> I don't think I understand the double kern_sync_rcu() below.
> > > With one kern_sync_rcu(), when I added bpf_printk() for memory_allocated,
> > > I sometimes saw two consecutive non-zero values, meaning memory_allocated[0]
> > > still see the previous test case result (memory_allocated[1]).
> > > ASSERT_LE() succeeds as expected, but somewhat unintentionally.
> > >
> > > bpf_trace_printk: memory_allocated: 0 <-- non exclusive case
> > > bpf_trace_printk: memory_allocated: 4160
> > > bpf_trace_printk: memory_allocated: 4160 <-- exclusive case's
> > > memory_allocated[0]
> > > bpf_trace_printk: memory_allocated: 0
> > > bpf_trace_printk: memory_allocated: 0
> > > bpf_trace_printk: memory_allocated: 0
> > >
> > > One kern_sync_rcu() is enough to kick call_rcu() + sk_destruct() but
> > > does not guarantee that it completes, so if the queue length was too long,
> > > the memory_allocated does not go down fast enough.
> > >
> > > But now I don't see the flakiness with NR_SEND 32, and one
> > > kern_sync_rcu() might be enough unless the env is too slow...?
> >
> > Ah, got it. I put you in the wrong path. It needs rcu_barrier() instead.
> >
> > Is recv() enough? May be just recv(MSG_DONTWAIT) to consume it only for UDP
> > socket? that will slow down the udp test... only read 1 byte and the remaining
> > can be MSG_TRUNC?
>
> recv() before close() seems to work with fewer sockets (with the same
> bytes of send()s and without kern_sync_rcu()).
>
> And the test time was mostly the same with "no recv() + 1 kern_sync_rcu()"
> case (2~2.5s on my qemu), so draining seems better.

Oh no, with NR_PAGES 128, "no recv() + 1kern_syn_rcu()" didn't
show the leftover at all, regardless of NR_SOCKETS==2 or 64.
But this might be only on my setup, and recv() before close()
could be stable on any setup theoretically.

fwiw, I put the latest code here.
https://github.com/q2ven/linux/commits/514_tcp_decouple_memcg

>
> #define NR_PAGES        128
> #define NR_SOCKETS      2
> #define BUF_TOTAL       (NR_PAGES * 4096 / (NR_SOCKETS / 2))
> #define BUF_SINGLE      1024
> #define NR_SEND         (BUF_TOTAL / BUF_SINGLE)
>
> NR_SOCKET==64
>       test_progs-1380    [008] ...11  2121.731176: bpf_trace_printk:
> memory_allocated: 0
>       test_progs-1380    [008] ...11  2121.737700: bpf_trace_printk:
> memory_allocated: 576
>       test_progs-1380    [008] ...11  2121.743436: bpf_trace_printk:
> memory_allocated: 64
>       test_progs-1380    [008] ...11  2121.749984: bpf_trace_printk:
> memory_allocated: 64
>       test_progs-1380    [008] ...11  2121.755763: bpf_trace_printk:
> memory_allocated: 64
>       test_progs-1380    [008] ...11  2121.762171: bpf_trace_printk:
> memory_allocated: 64
>
> NR_SOCKET==2
>       test_progs-1424    [009] ...11  2352.990520: bpf_trace_printk:
> memory_allocated: 0
>       test_progs-1424    [009] ...11  2352.997395: bpf_trace_printk:
> memory_allocated: 514
>       test_progs-1424    [009] ...11  2353.001910: bpf_trace_printk:
> memory_allocated: 2
>       test_progs-1424    [009] ...11  2353.008947: bpf_trace_printk:
> memory_allocated: 2
>       test_progs-1424    [009] ...11  2353.012988: bpf_trace_printk:
> memory_allocated: 2
>       test_progs-1424    [009] ...11  2353.019988: bpf_trace_printk:
> memory_allocated: 0
>
> While NR_PAGES sets to 128, the actual allocated page was around 300 for
> TCP and 500 for UDP.  I reduced NR_PAGES to 64, then TCP consumed 160
> pages, and UDP consumed 250, but test time didn't change.
>
> >
> > btw, does the test need 64 sockets? is it because of the per socket snd/rcvbuf
> > limitation?
>
> I chose 64 for UDP that does not tune rcvbuf automatically, but I saw an error
> when I increased the number of send() bytes and set SO_RCVBUF anyway,
> so any number of sockets is fine now.
>
> I'll use 2 sockets but will keep for-loops so that we can change NR_SOCKETS
> easily when the test gets flaky.
>
>
> >
> > Another option is to trace SEC("fexit/__sk_destruct") to ensure all the cleanup
> > is done but seems overkill if recv() can do.
>
> Agreed, draining before close() would be enough.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-09-19  2:43 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-17 19:13 [PATCH v9 bpf-next/net 0/6] bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated Kuniyuki Iwashima
2025-09-17 19:13 ` [PATCH v9 bpf-next/net 1/6] tcp: Save lock_sock() for memcg in inet_csk_accept() Kuniyuki Iwashima
2025-09-17 19:13 ` [PATCH v9 bpf-next/net 2/6] net-memcg: Allow decoupling memcg from global protocol memory accounting Kuniyuki Iwashima
2025-09-17 19:13 ` [PATCH v9 bpf-next/net 3/6] net-memcg: Introduce net.core.memcg_exclusive sysctl Kuniyuki Iwashima
2025-09-17 19:14 ` [PATCH v9 bpf-next/net 4/6] bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE Kuniyuki Iwashima
2025-09-17 19:14 ` [PATCH v9 bpf-next/net 5/6] bpf: Introduce SK_BPF_MEMCG_FLAGS and SK_BPF_MEMCG_EXCLUSIVE Kuniyuki Iwashima
2025-09-17 19:14 ` [PATCH v9 bpf-next/net 6/6] selftest: bpf: Add test for SK_MEMCG_EXCLUSIVE Kuniyuki Iwashima
2025-09-17 23:38   ` Martin KaFai Lau
2025-09-18  1:17     ` Kuniyuki Iwashima
2025-09-19  1:14       ` Martin KaFai Lau
2025-09-19  2:28         ` Kuniyuki Iwashima
2025-09-19  2:43           ` Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).